Bug #10448 Slave I/O thread hanging
Submitted: 8 May 2005 14:22 Modified: 2 May 2006 6:17
Reporter: [ name withheld ] Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.1.18, 4.1.11 OS:Linux (RHEL4, Debian Linux)
Assigned to: CPU Architecture:Any

[8 May 2005 14:22] [ name withheld ]
Description:
When the slave thread is in "connecteing to master" state it is completely blocking, and any attempts to kill the server fail, as the slave thread will not die.

How to repeat:
Set up replication. Point slave to some master. Stop mysql on the master, and start mysql on the slave.

Suggested fix:
Slave thread must not use a blocking connect call, and check it's kill flag on intervals.
[2 Jul 2005 11:04] Aleksey Kishkin
Hi! tested against 4.1.12 and was not able to reproduce it. Both (slave thread and server) were terminated without problem. If you have any ideas how to reproduce it pls let us know.
[14 Mar 2006 19:35] David Jennings
I am Experiencing the same problem running 4.1.18 on RedHat.  I Cannot reproduce but have experienced the same bug on multiple servers.

Interestingly enough, I have 2 servers, identical in hardware, software and SQL data. The only difference between them is that server A get's it slave turned on/off hourly. Whereas server B has its slave on all the time.  This problem occurs with server A approximately once a day.  It Never occurs on Server B.
[27 Mar 2006 9:42] Valeriy Kravchuk
For all the reportes:

please, specify the exact MySQL server binaries (URL), glibc and kernel versions for all the servers where this bug is observed.

Can anyone of you repeat it each and every time?
[2 Apr 2006 6:17] Valeriy Kravchuk
Sorry, but I was not able to repeat the described behaviour with 4.1.19-BK on Linux. I setuped replication, stopped slave, stopped master. Then I started slave, performed STOP SLAVE (successfully), and run the following commands:

openxs@suse:~/dbs/4.1> a=1
openxs@suse:~/dbs/4.1> while [ $a -le 100 ]; do echo "Step $a"; let a=a+1; bin/mysql -uroot test -e "start slave; show processlist; stop slave; show processlist;"; done;

I've got:

...
Step 100
+-----+-------------+-----------+------+---------+------+-----------------------------------------+------------------+
| Id  | User        | Host      | db   | Command | Time | State                  | Info             |
+-----+-------------+-----------+------+---------+------+-----------------------------------------+------------------+
| 615 | root        | localhost | test | Query   |    0 | NULL
                  | show processlist |
| 616 | system user |           | NULL | Connect |    0 | Connecting to master
                  | NULL             |
| 617 | system user |           | NULL | Connect |    0 | Waiting for the next e
vent in relay log | NULL             |
+-----+-------------+-----------+------+---------+------+-----------------------------------------+------------------+
+-----+------+-----------+------+---------+------+-------+------------------+
| Id  | User | Host      | db   | Command | Time | State | Info             |
+-----+------+-----------+------+---------+------+-------+------------------+
| 615 | root | localhost | test | Query   |    0 | NULL  | show processlist |
+-----+------+-----------+------+---------+------+-------+------------------+

So, server was successfully stopped 99 times, no hang.

openxs@suse:~/dbs/4.1> uname -a
Linux suse 2.6.11.4-20a-default #1 Wed Mar 23 21:52:37 UTC 2005 i686 i686 i386 GNU/Linux

It can be a problem specific to certain distributions and architectures. So, all reporters, please, send the uname -a results and answer the question from my previous comment - specify exact glibc version.
[5 Apr 2006 15:01] [ name withheld ]
I've been running earlier versions 4.1.16 & 4.1.15(or 14) for several months (more than a year?) w/o any slave issues (master-master config) under heavy load conditions. I recently upgraded to 4.1.18 and have experienced the following:
* Fedora FC1 (latest legacy release): Slave would not stop; Kill -9 did not work. Only fix was to reboot the machine (300+ days of no issues on previous mysql versions)
* Received a "too many connections" on primary master (expected behanior due to heavy load) and was not able to stop/start the slave. Reboot required.
* Replaced FC1 w/ OS Centos 4.3 (Like Redhat 4, 2.6.9-34.ELsmp) w/ MySQL 4.1.18 in a much less intensive load configuration. On minor mysql activity, I have seen both my primary and secondary (master-master config) server hang to a point where I had to do a hard reboot to gain access to services (e.g. mysql, ssh, http, etc). On the slave system (master now rebooted and fixed), while still serving requests, could not kill its slave (hangs); however, I am able to do a "kill -9" on mysql and start it again. I am only writing to one master at a time.
Bottom line: I have two issues (1) MySQL appears to be 4.1.18 is flakey on two different OSs (Linux 2.4 & 2.6)--I often cannot kill the slave; however, it's not predictable behavior. After killing & restarting I can restart the slave and (2) Server dies after 3-4 days wherein a reboot is required (at this time I cannot definitively attribute this to the MySQL service, but I absolutely cannot rule it out).
[2 May 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".