MySQL Bugs: #10448: Slave I/O thread hanging

Bug #10448	Slave I/O thread hanging
Submitted:	8 May 2005 14:22	Modified:	2 May 2006 6:17
Reporter:	[ name withheld ]	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	4.1.18, 4.1.11	OS:	Linux (RHEL4, Debian Linux)
Assigned to:		CPU Architecture:	Any

Description:
When the slave thread is in "connecteing to master" state it is completely blocking, and any attempts to kill the server fail, as the slave thread will not die.

How to repeat:
Set up replication. Point slave to some master. Stop mysql on the master, and start mysql on the slave.

Suggested fix:
Slave thread must not use a blocking connect call, and check it's kill flag on intervals.

Hi! tested against 4.1.12 and was not able to reproduce it. Both (slave thread and server) were terminated without problem. If you have any ideas how to reproduce it pls let us know.

I am Experiencing the same problem running 4.1.18 on RedHat.  I Cannot reproduce but have experienced the same bug on multiple servers.

Interestingly enough, I have 2 servers, identical in hardware, software and SQL data. The only difference between them is that server A get's it slave turned on/off hourly. Whereas server B has its slave on all the time.  This problem occurs with server A approximately once a day.  It Never occurs on Server B.

For all the reportes:

please, specify the exact MySQL server binaries (URL), glibc and kernel versions for all the servers where this bug is observed.

Can anyone of you repeat it each and every time?

Sorry, but I was not able to repeat the described behaviour with 4.1.19-BK on Linux. I setuped replication, stopped slave, stopped master. Then I started slave, performed STOP SLAVE (successfully), and run the following commands:

openxs@suse:~/dbs/4.1> a=1
openxs@suse:~/dbs/4.1> while [ $a -le 100 ]; do echo "Step $a"; let a=a+1; bin/mysql -uroot test -e "start slave; show processlist; stop slave; show processlist;"; done;

I've got:

...
Step 100
+-----+-------------+-----------+------+---------+------+-----------------------------------------+------------------+
| Id  | User        | Host      | db   | Command | Time | State                  | Info             |
+-----+-------------+-----------+------+---------+------+-----------------------------------------+------------------+
| 615 | root        | localhost | test | Query   |    0 | NULL
                  | show processlist |
| 616 | system user |           | NULL | Connect |    0 | Connecting to master
                  | NULL             |
| 617 | system user |           | NULL | Connect |    0 | Waiting for the next e
vent in relay log | NULL             |
+-----+-------------+-----------+------+---------+------+-----------------------------------------+------------------+
+-----+------+-----------+------+---------+------+-------+------------------+
| Id  | User | Host      | db   | Command | Time | State | Info             |
+-----+------+-----------+------+---------+------+-------+------------------+
| 615 | root | localhost | test | Query   |    0 | NULL  | show processlist |
+-----+------+-----------+------+---------+------+-------+------------------+

So, server was successfully stopped 99 times, no hang.

openxs@suse:~/dbs/4.1> uname -a
Linux suse 2.6.11.4-20a-default #1 Wed Mar 23 21:52:37 UTC 2005 i686 i686 i386 GNU/Linux

It can be a problem specific to certain distributions and architectures. So, all reporters, please, send the uname -a results and answer the question from my previous comment - specify exact glibc version.

I've been running earlier versions 4.1.16 & 4.1.15(or 14) for several months (more than a year?) w/o any slave issues (master-master config) under heavy load conditions. I recently upgraded to 4.1.18 and have experienced the following:
* Fedora FC1 (latest legacy release): Slave would not stop; Kill -9 did not work. Only fix was to reboot the machine (300+ days of no issues on previous mysql versions)
* Received a "too many connections" on primary master (expected behanior due to heavy load) and was not able to stop/start the slave. Reboot required.
* Replaced FC1 w/ OS Centos 4.3 (Like Redhat 4, 2.6.9-34.ELsmp) w/ MySQL 4.1.18 in a much less intensive load configuration. On minor mysql activity, I have seen both my primary and secondary (master-master config) server hang to a point where I had to do a hard reboot to gain access to services (e.g. mysql, ssh, http, etc). On the slave system (master now rebooted and fixed), while still serving requests, could not kill its slave (hangs); however, I am able to do a "kill -9" on mysql and start it again. I am only writing to one master at a time.
Bottom line: I have two issues (1) MySQL appears to be 4.1.18 is flakey on two different OSs (Linux 2.4 & 2.6)--I often cannot kill the slave; however, it's not predictable behavior. After killing & restarting I can restart the slave and (2) Server dies after 3-4 days wherein a reboot is required (at this time I cannot definitively attribute this to the MySQL service, but I absolutely cannot rule it out).

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".