MySQL Bugs: #5001: mysqld_safe kills all mysqlds when only one dies (under multiple installations)

Bug #5001	mysqld_safe kills all mysqlds when only one dies (under multiple installations)
Submitted:	11 Aug 2004 20:20	Modified:	2 Sep 2004 0:19
Reporter:	Jason Sydes	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	4.0.20	OS:	Linux (linux)
Assigned to:	Bugs System	CPU Architecture:	Any

Description:
When running multiple mysql installations (ie mysqld_multi), and one installation dies,
mysqld_safe will kill ~all~ mysql processes on that machine.

This should be relatively easy to fix.

Here's the relevant line from mysqld_safe:

PROC=`ps xa | grep $ledir/$MYSQLD | grep -v "grep" | sed -n '$p'`

which later leads to
for T in $PROC
....
if kill -9 $T

(We don't actually use mysqld_multi (we hacked up /etc/init.d/mysqld years ago before
mysqld_multi was available), but I'm 99% sure that it doesn't matter.)

On our installation, when you run the script, variable substitution results in the following line:
ps xa | grep /usr/local/dh/mysql/base/libexec/mysqld | grep -v "grep" | sed -n '$p'

Since both mysqld_multi and our /etc/init.d/mysqld actually uses mysqld_safe, and we have not
modified mysqld_safe (beyond compilation flags to set the basedir as shown above), I'm almost
positive you'll have the same problem if you are running multiple installations of mysql with the
standard mysqld_multi setup.

Our /etc/init.d/mysqld is just a mildly modified version of your mysql.server.sh script.

In any case, the "PROC" line above doesn't result in any kind of selectivitity, and the "kill" then
kills indiscriminately.

So if I'm running 5 separate installations of mysql on my machine, and one of them segfaults, the
mysqld_safe for that single dead mysql installation will then kill all the processes of the four other
(otherwise happily running) mysql installations.

Hope I'm making sense here.

We've been seeing this for years, and I just now found it.

How to repeat:
Install several installations of mysql on a single machine.

run `ps aux' to find all the processes of a single mysql installation.
maybe something like:
ps aux | grep /tmp/mysql.sock3

in a second window, do a `tail -f ' on the error logs of a few of the ~other~ mysql installations

back to the first window:
don't kill the mysqld_safe process.
instead, kill -9 one of the mysqld processes.

watch the error logs of the other mysql installations (in the other window).
you'll start seeing stuff like this:

Number of processes running now: 300
Number of processes running now: 96
Number of processes running now: 73
Number of processes running now: 53

mysqld process hanging, pid 5230 - killed
mysqld process hanging, pid 5201 - killed
mysqld process hanging, pid 5487 - killed
mysqld process hanging, pid 3809 - killed
040811 10:08:51 mysqld restarted

Suggested fix:
Quick fix for me (and others?):
Change:
KILL_MYSQLD=1;
to
KILL_MYSQLD=0;
in mysqld_safe.

Permanent fix:
Looks like you can probably just do:
PROC=`ps xa | grep $ledir/$MYSQLD | grep $pid_file | grep -v "grep" | sed -n '$p'`
and it will be fixed.

This is still the case in 4.1.3.  Maybe the safest way would be to store the socket file and use 
fuser?

fuser is not portable.

But doesn't the pid file always appear on the command line?
Isn't that the most portable way to ensure that you're killing the right processes?

I suppose you could accidentally kill another process, say `less /tmp/mysql6.pid', but that seems 
a relatively harmless consequence....

fuser is portable enough, the killing code is protected by @IS_LINUX@ anyway.
at least all our linux hosts (and many non-linux ones, *bsd excluded) have fuser.

I cannot get the failure to occur. I have started 4 servers via mysqld_multi, killed one of the servers yet all the rest of the remaining servers continue to run.

Really?
What is the "ps" line look like inside your mysqld_safe?

Actually, when you kill one of the mysqld processes, does that server restart?  Or does it just 
keep running?  When I killed mine, it would restart that entire server, but, also, it would kill all 
the mysql processes, including those of the other servers.

Are you seeing the following line in the error log of the "killed" server:
   "Number of processes now running: blah"
with "blah" replaced by some number (probably around 10 or 20 in your test case, though for me 
it was generally several hundred).  
You might see that line multiple times too.
That line is spit out by mysqld_safe.

If you're not seeing that line, you're not going to see the same behavior I saw.

If you are seeing that line, are you seeing that line in the other servers' error logs as well?  If so, 
then that's the bug that I'm reporting.

If you are seeing that line, and you're only seeing it the error log of the "killed" server, and 
you're **not** seeing it in the error logs of the other servers, then I am truly perplexed (I'll 
explain myself after hearing what you're seeing).

When you do a `SHOW STATUS' or `mysqladmin stat' on the "killed" server, does that show you 
that it had been restarted (ie does the uptime show a low number)?  How about for the other 
servers?

Thanks for your attention to this Patrick!
Jason

Jason: when I kill one process of a running server ( 1 out of 3 servers), I see this in its log:

040817 15:47:50  mysqld restarted
040817 15:47:50  Warning: Asked for 196608 thread stack, but got 126976
040817 15:47:50  Can't start server: Bind on TCP/IP port: Address already in use
040817 15:47:50  Do you already have another mysqld server running on port: 3308 ?
040817 15:47:50  Aborting

040817 15:47:50  /usr/local/mysql/bin/mysqld: Shutdown Complete

040817 15:47:50  mysqld ended

As far as processes, the other servers don't get touched. The shutdown server remains unstarted, and the others stay running, same exact PIDs before and after.

I would suggest starting your servers with mysqld_multi and using mysqld_multi settings in my.cnf.

Hope this helps;)

Patrick, the reported bug is in *mysqld_safe*
Thus to repeat it you need to use mysqld_safe as in mysqld_multi --example or with --mysqld=mysqld_safe.

And then it's perfectly repeatable

It works as intended, even with mysqld=/path/to/safe_mysqld (both 4.0.20 and 4.1.3):

rsna:/usr/local # /usr/local/mysql/bin/mysqld_multi --user=mysql --mysqld=/usr/local/mysql/bin/mysqld_safe start
krsna:/usr/local # ps aux|grep msyql
root      6053  0.0  0.1  1764  568 pts/6    R    13:15   0:00 grep msyql
krsna:/usr/local # ps aux|grep mysql
root     25261  0.0  0.0  2092    4 pts/5    S    Aug23   0:00 mysql mysql
root      5955  0.2  0.2  2676 1200 pts/6    S    13:15   0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --socket=/tmp/mysql.sock2 --port=3307 --pid-file=/usr/local/mysql/data2/hostname.pid2 --datadir=/usr/local/mysql/data2
root      5958  0.3  0.2  2676 1200 pts/6    S    13:15   0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --socket=/tmp/mysql.sock3 --port=3308 --pid-file=/usr/local/mysql/data3/hostname.pid3 --datadir=/usr/local/mysql/data3
root      5961  0.3  0.2  2676 1200 pts/6    S    13:15   0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --socket=/tmp/mysql.sock4 --port=3309 --pid-file=/usr/local/mysql/data4/hostname.pid4 --datadir=/usr/local/mysql/data4
mysql     6032  1.4  2.0 30304 10548 pts/6   D    13:15   0:00 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data4 --user=mysql --log=/usr/local/mysql/data4/krsna.log --pid-file=/usr/local/mysql/data4/hostname.pid4 --skip-locking --port=3309 --socket=/tmp/mysql.sock4
mysql     6034  1.4  2.0 30304 10548 pts/6   D    13:15   0:00 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data2 --user=mysql --log=/usr/local/mysql/data2/krsna.log --pid-file=/usr/local/mysql/data2/hostname.pid2 --skip-locking --port=3307 --socket=/tmp/mysql.sock2
mysql     6036  1.5  2.0 30304 10548 pts/6   D    13:15   0:00 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data3 --user=mysql --log=/usr/local/mysql/data3/krsna.log --pid-file=/usr/local/mysql/data3/hostname.pid3 --skip-locking --port=3308 --socket=/tmp/mysql.sock3

krsna:/usr/local # kill -9 6036
krsna:/usr/local # mysql -S /tmp/mysql.sock4
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2 to server version: 4.0.20-standard-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> \q
Bye
krsna:/usr/local # mysql -S /tmp/mysql.sock2
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2 to server version: 4.0.20-standard-log

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> \q
Bye
krsna:/usr/local # mysql -S /tmp/mysql.sock3
ERROR 2002: Can't connect to local MySQL server through socket '/tmp/mysql.sock3' (2)
krsna:/usr/local # ps aux|grep mysql
root     25261  0.0  0.0  2092    4 pts/5    S    Aug23   0:00 mysql mysql
root      5955  0.0  0.2  2676 1200 pts/6    S    13:15   0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --socket=/tmp/mysql.sock2 --port=3307 --pid-file=/usr/local/mysql/data2/hostname.pid2 --datadir=/usr/local/mysql/data2
root      5961  0.0  0.2  2676 1200 pts/6    S    13:15   0:00 /bin/sh /usr/local/mysql/bin/mysqld_safe --socket=/tmp/mysql.sock4 --port=3309 --pid-file=/usr/local/mysql/data4/hostname.pid4 --datadir=/usr/local/mysql/data4
mysql     6032  0.1  2.0 30352 10644 pts/6   S    13:15   0:00 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data4 --user=mysql --log=/usr/local/mysql/data4/krsna.log --pid-file=/usr/local/mysql/data4/hostname.pid4 --skip-locking --port=3309 --socket=/tmp/mysql.sock4
mysql     6034  0.1  2.0 30352 10644 pts/6   S    13:15   0:00 /usr/local/mysql/bin/mysqld --basedir=/usr/local/mysql --datadir=/usr/local/mysql/data2 --user=mysql --log=/usr/local/mysql/data2/krsna.log --pid-file=/usr/local/mysql/data2/hostname.pid2 --skip-locking --port=3307 --socket=/tmp/mysql.sock2

krsna:/usr/local # tail -7 /usr/local/mysql/data2/krsna.err
040817 17:29:09  mysqld ended

040825 13:15:23  mysqld started
040825 13:15:23  Warning: Asked for 196608 thread stack, but got 126976
040825 13:15:30  InnoDB: Started
/usr/local/mysql/bin/mysqld: ready for connections.
Version: '4.0.20-standard-log'  socket: '/tmp/mysql.sock2'  port: 3307

rsna:/usr/local # tail -10 /usr/local/mysql/data3/krsna.err
040825 13:16:33  mysqld restarted
040825 13:16:33  Warning: Asked for 196608 thread stack, but got 126976
040825 13:16:33  Can't start server: Bind on TCP/IP port: Address already in use
040825 13:16:33  Do you already have another mysqld server running on port: 3308 ?
040825 13:16:33  Aborting

040825 13:16:33  /usr/local/mysql/bin/mysqld: Shutdown Complete

040825 13:16:33  mysqld ended
krsna:/usr/local # tail -7 /usr/local/mysql/data4/krsna.err
040817 17:29:08  mysqld ended

040825 13:15:23  mysqld started
040825 13:15:23  Warning: Asked for 196608 thread stack, but got 126976
040825 13:15:30  InnoDB: Started
/usr/local/mysql/bin/mysqld: ready for connections.
Version: '4.0.20-standard-log'  socket: '/tmp/mysql.sock4'  port: 3309

So, you kill -9 a threat of server #3 (we'll call it #3 in this case), and mysqld_safe tries to restart, but I assume the port is still in use because it's in the process of dying, so it gives up, and also doesn't try to kill any other mysqld process on any other port. My my.conf file is this:

[mysqld_multi]
mysqld     = /usr/local/mysql/bin/mysqld_safe
mysqladmin = /usr/local/mysql/bin/mysqladmin

[mysqld2]
socket     = /tmp/mysql.sock2
port       = 3307
pid-file   = /usr/local/mysql/data2/hostname.pid2
datadir    = /usr/local/mysql/data2

[mysqld3]
socket     = /tmp/mysql.sock3
port       = 3308
pid-file   = /usr/local/mysql/data3/hostname.pid3
datadir    = /usr/local/mysql/data3

[mysqld4]
socket     = /tmp/mysql.sock4
port       = 3309
pid-file   = /usr/local/mysql/data4/hostname.pid4
datadir    = /usr/local/mysql/data4

With help from Serg, we found that there was an error in the regular expression that caused safe_mysqld to kill itself and not each mysqld thread:

 PROC=`ps xa | grep $ledir/$MYSQLD | grep -v "grep" | sed -n '$p'`

changed to

PROC=`ps xa | grep "$ledir/$MYSQLD\>" | grep -v "grep" | sed -n '$p'`

This allowed it to manifest the original error that this bug has been logged for, which I've fixed by having logic test to see if $mysql_tcp_port set, and grepping to see if the port is in the ps list:
if test -n "$mysql_tcp_port"
    then
      numofproces=`ps xa | grep -v "grep" | grep $ledir/$MYSQLD| grep -c "port=$mysql_tcp_port"`
    else
      numofproces=`ps xa | grep -v "grep" | grep -c $ledir/$MYSQLD`
    fi

    echo -e "\nNumber of processes running now: $numofproces" | tee -a $err_log
    I=1
    while test "$I" -le "$numofproces"
    do
      if test -n "$mysql_tcp_port"
      then
        PROC=`ps xa | grep "$ledir/$MYSQLD\>" | grep -v "grep" | grep "port=$mysql_tcp_port" | sed -n '$p'`
      else
        PROC=`ps xa | grep "$ledir/$MYSQLD\>" | grep -v "grep" | sed -n '$p'`
      fi

NOTE: mysqld_multi always starts mysqld_safe with a port set, so this is a good means of checking this