Bug #23621 Slave io_thread hangs after master crash
Submitted: 25 Oct 2006 11:03 Modified: 30 Nov 2006 12:58
Reporter: Mikhail Petrov Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.1.15-standart OS:Linux (SLES 9 (x64))
Assigned to: CPU Architecture:Any
Tags: io_thread hangs, master crash, replication

[25 Oct 2006 11:03] Mikhail Petrov
Description:
I have simple replication schema - one master, three slaves.  MySQL version - 4.1.15, downloaded from mysql.com. OSes - SuSe 9 on x64 (AMD Opteron).
Sometimes I see such situation:
MySQL-master crashes on variable reasons (simetimes simply kill -9). After that slave io_thread hangs with status "Reconnecting after a failed master event read".
Command "Stop slave" hangs too. When I try to kill slave threads manually - I see such picture:
| 428340 | system user | | NULL | Killed | 325011 | Reconnecting after a failed master event read | NULL |
| 428341 | system user | | NULL | Killed | 2506 | Waiting for slave mutex on exit | NULL |
| 531281 | root | somehost.local:60094 | NULL | Query | 1053 | Killing slave | slave stop |
Simple kill on mysqld on slave isn't working, only `kill -9`.

How to repeat:
Try to `kill -9` on mysqld on master server. Sometimes slave hangs, sometimes - no.
I have noticed, that if I do `slave stop` less than net_read_timeout - slave stops normally. Perhaps, it's not concerned with, but maybe it'll be helpful.
If you need more information - mail me, I'll try to help. This error is really serious for me.
[25 Oct 2006 11:50] Valeriy Kravchuk
Thank you for a problem report. Please, try to repeat with a newer version, 4.1.21, on both master and salave, and inform about the results. 

Send the results of:

getconf GNU_LIBPTHREAD_VERSION;

from master and slaves.
[25 Oct 2006 12:07] Mikhail Petrov
NPTL 2.3.5 on both servers.
[25 Oct 2006 12:11] Mikhail Petrov
Sorry, but I can't use 4.1.21 while this bug - http://bugs.mysql.com/bug.php?id=21456 is not fixed :(
However, I maybe can test 4.1.21 on slave. Can it help you?
[25 Oct 2006 13:14] Valeriy Kravchuk
Can you, please, try to set:

export LD_ASSUME_KERNEL=2.4.1

in the script used to start MySQL server (on slave), and check for the same behaviour? The idea is to check if LinuxThreads vs. NPTL makes any difference.

Testing of 4.1.21 on slave will be also useful.
[31 Oct 2006 6:14] David Hillman
FYI, this bug is also present in 4.1.18 on 32-bit Linux.  Happened to me this evening, with, coincidentally, exacting the same setup, single master with three slaves.  All three slaves were completely hung, and wouldn't even shutdown following a master crash.
[31 Oct 2006 6:18] David Hillman
Oh yeah, one additional point.  When you issue "stop slave", if you let it sit long enough, you will fill up all your connections with "show slave status" commands.

   I also can't use 4.1.21 because of existing bugs.  I do have it installed on a few test servers, and will try to test this bug when I have a chance.
[31 Oct 2006 8:25] Mikhail Petrov
David, FYI, I have this bug even in a simple 'master-slave' scheme.
I think, it is not a bug on master, and it is not concerned with slaves count.
[31 Oct 2006 12:58] Valeriy Kravchuk
Mikhail,

Can you, please, try to set:

export LD_ASSUME_KERNEL=2.4.1

in the script used to start MySQL server (on slave), and check for the same
behaviour? The idea is to check if LinuxThreads vs. NPTL makes any difference.
[1 Dec 2006 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".