MySQL Bugs: #21781: Replication slave io thread hangs

Bug #21781	Replication slave io thread hangs
Submitted:	22 Aug 2006 12:23	Modified:	15 Mar 2007 16:41
Reporter:	Andrew Tulloch	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S2 (Serious)
Version:	5.0.24	OS:	FreeBSD (FreeBSD 6.1-p3, Linux, all)
Assigned to:	Magnus Blåudd	CPU Architecture:	Any
Tags:	bfsm_2007_02_15, openssl, SSL

Description:
After building MySQL 5.0.24 from the FreeBSD port on two AMD64 machines (one UP 1GB ram, one SMP 6GB ram) and configuring them for simple master->slave replication the slave io thread will after a short amount of time hang in Slave_IO_State "Waiting to reconnect after a failed master event read". Issuing a "slave stop" or "mysqladmin shutdown" will also hang.

issuing a "flush logs" or "mysqladmin shutdown" on the master unfreezes the slave and the slave will start replicating again (or the "slave stop" or "mysqladmin shutdown" commands will complete).

This configuration with MySQL 5.0.22 built again from the FreeBSD port worked correctly.

How to repeat:
Build the port with "make WITH_OPENSSL=yes" on two FreeBSD machines.

I have a /etc/libmap.conf to use libthr as below:
[/usr/local/libexec/mysqld]
libpthread.so.2 libthr.so.2
libpthread.so libthr.so

I've also tested without those lines in libmap.conf (so using libpthread) and got the same results.

I'm not certain SSL has anything to do with it, but my builds are using it. Configure for replication with replication users having REQUIRE SSL, obviously setting up certs for mysqld so that SSL works.

Start both servers, let the slave connect to the master, then issue a "slave stop" command on it, the mysql prompt should hang there, another mysql cli should reveal the thread running "slave stop" is in state "Killing slave" and will stay there indefinitely (tested up to 13mins).

Open a mysql cli on the master and issue "flush logs", the "slave stop" command on the slave will now complete.

Thank you for the report.

I can not repeat the problem using current BK sources. Could you please provide your ktrace file?

I can reproduce on FreeBSD 4.10 and 4.11 using 4.0.26 and 4.1.18 (both of which are older version but these are in production)

easy test scenario:
install MySQL on machine 'A'
  ensure log-bin is set in my.cnf
  grant all on *.* to repl@'%' identified by 'repl' (for convenience)

install MySQL on machine 'B'
  change master to
    master_host = 'MachineA'
    master_log_file = 'machinea-bin.000001'
    master_pass = 'repl';

on Machine B:
stop slave;
(gets stuck w/ status 'Killing Slave', see processlist below)

Once you try to kill the slave, anything else slave related (like show slave status) also hangs as demonstrated from this 'show full processlist' after I shut the slave down on Machine 'B'.

Any event that writes to the binary log on Machine 'A' will end the slave i/o thread, such as 'flush logs'.  Killing the binlog dump process on the master will also stop the i/o thread on the slave.

------------------
mysql> show full processlist \G
*************************** 1. row ***************************
     Id: 28
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 254
  State: Killing slave  <----------- STUCK IN KILLING SLAVE
   Info: stop slave
*************************** 2. row ***************************
     Id: 31
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 263
  State: Waiting for master to send event
   Info: NULL
*************************** 3. row ***************************
     Id: 32
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 263
  State: Has read all relay log; waiting for the slave I/O thread to update it
   Info: NULL
*************************** 4. row ***************************
     Id: 33
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 153
  State: NULL
   Info: show slave status        <-- ALSO STUCK AFTER STOP SLAVE ISSUE
*************************** 5. row ***************************
     Id: 34
   User: root
   Host: localhost
     db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: show full processlist
5 rows in set (0.00 sec)

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I attached the ktrace file as requested, but have seen no response since, I would've attached it sooner, but was on holiday for a week.

I was about to submit a bug for the Linux OS, but this appears to be the same issue.  If I do:
    Mysql> start slave;
    Mysql> stop slave;
The mysql client will hang indefinitely attempting to stop the slave.  The only options at that point are (A) "killall -9 mysqld" or (B) log into the master machine and kill the slave's replication process.

Additionally, rotating logs seems to break the connection between master and slave.  A processlist on the slave shows it attempting to reconnect to the master, yet on the master, the slave process is still in existence (waiting for some data to send back).

This is on a (Slackware 10.2) Linux platform running kernel 2.6.18, glibc 2.3.5, mysql 5.0.24a compiled against openssl 0.9.8d and with a slave user that connects via SSL.  MySQL is compiled by hand with configure patched to correctly find the location of the openssl libs.

I see the same bug on Debian unstable, running MySQL 5.0.24a compiled from source with OpenSSL enabled. STOP SLAVE hangs unless:

a) The master is connected to the slave and
b) Either the master or the slave executes a SQL query such as INSERT or FLUSH LOGS

Both a) and b) must occur in that order for STOP SLAVE to return.

Can we have an ETA for a fix?

I have tested again on MySQL 5.0.24a compiled with OpenSSL support, Debian unstable. This bug does not occur if SSL is disabled.

Connect to master via SSL and STOP SLAVE hangs. Connect to master without SSL and STOP SLAVE returns immediately. Conclusion: something is wrong with MySQL's implementation of replication using SSL.

Thank you for the feedback and comments.

Could you all please try using 5.0.26 version accessible from http://dev.mysql.com/downloads/mysql/5.0.html?

I've reproduce this bug with 5.0.26 on freebsd 5.4
Mysql compiled from sources with linuxthreads.

Exactly, on slave "mysqladmin shutdown" or STOP SLAVE hangs until FLUSH LOGS or shutdown master.

When using native (KSE) threads in freebsd 5.4, STOP SLAVE working right.

Tested MySQL 5.0.26 compiled with openssl support.

STOP SLAVE still hangs as before, unless an SQL statement such as FLUSH LOGS is executed on the master.

Running Debian Unstable (Etch) with NPTL 2.3.6 on a 2.6.18 Linux kernel.

I can confirm that the same problem exists with the following builds:

Red Hat EL 4, MySQL 5.0.24a
Solaris 9 (SPARC), MySQL 5.0.24a, 5.0.27
Solaris 10 (x86), MySQL 5.0.24a (64bit), 5.0.27 (64bit)

On my Solaris systems, I've tried MySQL linked to OpenSSL 0.9.8c and 0.9.8d.

If the user on the slave server used to perform the replication doesn't use
SSL, then the slave server can be shut down without having to flush the logs
on the master server.

Liam, please, provide your configure options and name of compiler you use for Solaris 10 x86 builds.

My options for building on Solaris 10 x86, using Sun Studio 11:

setenv CFLAGS -xarch=amd64
setenv CXXFLAGS -xarch=amd64
setenv LDFLAGS "-R/usr/local/openssl/lib -xarch=amd64"

./configure --prefix=/usr/local/mysql --enable-thread-safe-client --with-openssl=/usr/local/openssl

OpenSSL build is obviously also 64bit, built with the same compiler.

On Solaris 9 I only build in 32bits. Even using the same compiler, it's necessary to add a couple of other options to CFLAGS and CXXFLAGS:

setenv CFLAGS "-D_POSIX_C_SOURCE=199506L
setenv CXXFLAGS -D__EXTENSIONS__""-D_POSIX_C_SOURCE=199506L -D__EXTENSIONS__"

Hi all
I've two notices: 

1) Tried with various versions on various platforms. Replication over SSL works well with 5.0.18, broken since 5.0.24.

2) I noticed that only the IO thread seems to hang:
STOP SLAVE SQL_THREAD returns with no errors, while
STOP SLAVE IO_THREAD locks up the slave server.

In the wake of testing I had one case (not reproducible right now, though), where the slave server locked up during normal operation, without the replication slave thread being explicitly stopped, and the master server locked up shortly afterwards.

hth

Could please someone (Sveta?) change the OS Tag of this bug? It's really cross-plattform, and I think it's quite a showstopper.

Hi
This bug is still in the status "Need Feedback". Is there anything we can do to get it accepted and worked on?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Lots of feedback has been provided. What other information could we provide that can help?

Thank you, Liam, for the comments and configuration string. I was away from my computer last week and therefore didn't change status of the report in time.

Automatically message "No feedback" generates if status of bug was "Need feedback" and original reporter do not provide feedback.

No problem Sveta, let us know if there's anything else we can do to help.

I just had the same kind of problem using 5.0.27 and no SSL replication. For some reason the slave replication hangs with no further explanation.

It started with a lot of errors like this in the slave server:

061122 14:21:35 [ERROR] Got error 134 when reading table './dbname/table'
061122 14:21:36 [ERROR] Got error 134 when reading table './dbname/table'
061122 14:21:36 [ERROR] Got error 134 when reading table './dbname/table'
061122 14:21:37 [ERROR] Got error -1 when reading table './dbname/table'
061122 14:21:37 [ERROR] Got error -1 when reading table './dbname/table'

Then the error:

mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=524288000
read_buffer_size=1044480
max_used_connections=701
max_connections=700
threads_connected=682
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 3376394 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

You seem to be running 32-bit Linux and have 682 concurrent connections.
If you have not changed STACK_SIZE in LinuxThreads and built the binary
yourself, LinuxThreads is quite likely to steal a part of the global heap for
the thread stack. Please read http://www.mysql.com/doc/en/Linux.html

thd=0x9ac6ed98
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xbe61f088, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x80a3877
0x82f21d8
0x82bb1af
0x815d5cb
0x80b4e70
0x80b3af4
0x80b3044
0x82ef98c
0x83192ca
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/Using_stack_trace.html and follow instructions on how to resolve the stack trac
e. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at (nil)  is invalid pointer
thd->thread_id=1154060
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

Number of processes running now: 0
061122 14:16:19  mysqld restarted
061122 14:16:19 [Warning] Asked for 196608 thread stack, but got 126976
061122 14:16:19 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.0.27-standard'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Edition - Standard (GPL)
061122 14:16:19 [Note] Slave SQL thread initialized, starting replication in log 'mysql-bin.000130' at position 102123129,
relay log './v6-relay-bin.000006' position: 102123266
                                values (
                                        "1169895826",
                                        "1994",
                                        "3ade68b7g5d465ea3",
                                        "18971238",
                                        now()
                                        )', Error_code: 126

=========================================
After that, it crashed again...

mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=524288000
read_buffer_size=1044480
max_used_connections=608
max_connections=700
threads_connected=520
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 3376394 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

You seem to be running 32-bit Linux and have 520 concurrent connections.
If you have not changed STACK_SIZE in LinuxThreads and built the binary
yourself, LinuxThreads is quite likely to steal a part of the global heap for
the thread stack. Please read http://www.mysql.com/doc/en/Linux.html

thd=0x9a9cbd38
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xbcc9d1f8, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x80a3877
0x82f21d8
0x8285f67
0x1
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/Using_stack_trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort...
thd->query at 0xa15ab50 = SELECT  p.id
                            FROM l,p,b
                           WHERE b.id  = p.bID
                             AND p.id = l.pID
                             AND MATCH(content)
                                 AGAINST ("search string here")
                           LIMIT 100
thd->thread_id=380
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

Number of processes running now: 21
mysqld process hanging, pid 8822 - killed
mysqld process hanging, pid 8557 - killed

I appreciate your attention.

Best Regards,
Daniel
http://www.webcit.com.br/

test case

Attachment: rpl_bug21781.test (application/octet-stream, text), 579 bytes.

Bug is not repeatable with last BK sources compiled using BUILD/compile-ppc-debug-max script on Intel Mac.

For testing used attached test case.

OK, this seems to be stalling.

The problem does not appear when using the bundled yaSSL libs, or when using any source prior to 5.0.23 (5.0.22 and below is - contrary to my earlier posting - not affected)

The SETUP.sh script called from compile-ppc-debug-max uses yaSSL, so it will not reproduce the error:
[...]
# SSL library to use.
SSL_LIBRARY=--with-yassl
[...]

Please try to reproduce the bug with openssl. Various people here have done so, on various platforms, and they all report the same symptom.

Also please note that most of us used two distinct machines for testing. As far as I understand from the few lines of the test case (Sveta, please provide the files included in the first two lines as well), it was run on one single machine. 

Please ask for any further information you might need to work on this. (I will add an attachment with a step-by-step description of how to reproduce - which does not differ from earlier posts, though)

Hope this helps
/markus

Description of compilation and how to reproduce:

http://xfer.ch/files/reproduce_bug21781.txt (couldn't add file to bug)

Thank you for the feedback.

Please upgrade to current 5.0.33 server and try with our example certificates located in the source-dir/mysql-test/std_data directory.

We have bug report (Bug #25189) about not forgiving behaviour if certificates contain leading white-space symbols. I want to check if your case something correlated with that.

I have the same (or a very similar bug) with MySQL 5.0.33 on Linux.  I set up a replicating master/slave pair going using the same compile and run time configuration which I have used successfully with MySQL 5.0.15.

But the slave stops replicating after a few minutes, with nothing in the logs.

At that point, connections to MySQL which attempt to "STOP SLAVE", or "SHOW SLAVE STATUS" will also hang -- including the mysql command line client.

If I leave it "hung" like this, after about 7 minutes it spontaneously (?) unfreezes itself and this appears in the slave's mysql.err log:

070131 12:00:36 [Note] Slave I/O thread killed while waiting to reconnect after a failed read
070131 12:00:36 [Note] Slave I/O thread exiting, read up to log 'frodo.000002', position 98
070131 12:00:36 [Note] Error reading relay log event: slave SQL thread was killed

My mysql configuration file includes this:

# How many seconds to wait for master before deciding the connection is broken and retrying, default is 3600
slave-net-timeout=600
# Seconds to wait between reconnect retries. Default is 60.
master-connect-retry=10

Having read through the comments on this bug, I'm concerned that it is still marked as "Need Feedback".  Many people have provided feedback.  This is a showstopper bug for us as well.   What additional feedback is required?

Master/slave replication not working seems like it ought to be a very high priority to have fixed as soon as possible!  Can we get an ETA on a fix for this?  Ot at least a status update?

Hi Torrey,

bug is in the "Need feedback" status, because nobody from mysql.com haven't repeated this bug yet. But because many people outside can repeat, bug is open.

I tried to repeat this bug at least on 5 different machines without success. To check my guesswork what problem can be certificate handling, I asked to try our certificates we use for tests. It is mean why it is in the "Need feedback" status.

Another update:

I have reproduced this problem again using MySQL 5.0.33, and the SSL keys which came with the MySQL distribution package in the mysql-test/std_data directory.

I set the system up as I always have before.  A database snapshot was obtained on the master using mysqldump, copied to the slave machine, customized to include the MASTER_HOST, MASTER_USER, MASTER_PASSWORD, and MASTER_SSL=1 information, and then installed.

I then started slave replication with 'START SLAVE'.  At that point, replication appeared to be working -- for a few seconds at least.  The output of 'SHOW SLAVE STATUS' included:

             Slave_IO_State: Waiting for master to send event
                Master_Host: frodo.lockdownnetworks.com
                Master_User: ha
                Master_Port: 3306
              Connect_Retry: 15
            Master_Log_File: frodo.000002
        Read_Master_Log_Pos: 98
             Relay_Log_File: sam-relay-bin.000002
              Relay_Log_Pos: 231
      Relay_Master_Log_File: frodo.000002
           Slave_IO_Running: Yes
          Slave_SQL_Running: Yes
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 98
            Relay_Log_Space: 231
            Until_Condition: None
         Master_SSL_Allowed: Yes
         Master_SSL_CA_File:
         Master_SSL_CA_Path: /etc/mysql
            Master_SSL_Cert: /etc/mysql/client-cert.pem
          Master_SSL_Cipher:
             Master_SSL_Key: /etc/mysql/client-key.pem
      Seconds_Behind_Master: 0

However, a few seconds later, replication stopped:  there were no messages in the mysql.err log on either the master or the slave, but the output of 'SHOW SLAVE STATUS' changed as follows (showing the changed lines -- in particular, all the *_Log_Pos lines were the same):

             Slave_IO_State: Waiting to reconnect after a failed master event read
      Relay_Master_Log_File: frodo.000002
           Slave_IO_Running: No
          Slave_SQL_Running: Yes
      Seconds_Behind_Master: NULL

At this point I tried to 'SHOW SLAVE STATUS' again, and the command hung.  With another connection, I tried 'STOP SLAVE' and that hung too.  'Ctrl-C' at the mysql prompt results in 'Query aborted by Ctrl+C' but it does not actually return to a prompt, it is still hung.  A second 'Ctrl-C' returns me to the shell command prompt.

I followed the tip reported by Justin Swanhart and issued 'FLUSH LOGS' on the master.  That un-froze the slave database. 

I then discovered that if the slave database replication stops, with the "Slave_IO_State: Waiting to reconnect after a failed master event read" state, I can issue a 'FLUSH LOGS' on the master and it will correct the problem... for a few seconds at least, it will go back to "Slave_IO_State: Waiting for master to send event".

It gets more interesting as I continue to experiment...

If I repeatedly issue an UPDATE command -- even if it doesn't change anything in the database -- the slave system will maintain the good "Slave_IO_State: Waiting for master to send event" state, and the output of 'SHOW SLAVE STATE' will show the Read_Master_Log_Pos counter incrementing.  

But if I stop issuing that do-nothing UPDATE command, within a few seconds the slave will revert to "Slave_IO_State: Waiting to reconnect after a failed master event read".  This is completely reliable... database activity -- at least, any activity which might modify the database -- keeps the slave running, but if nothing happens on the slave, replication stops!

So... it seems like a "workaround" is to keep issuing do-nothing update commands or flush logs on the master machine every second!

Thank you all for the feedback.

Please try to create core file as described at http://dev.mysql.com/doc/refman/5.0/en/using-gdb-on-mysqld.html and attach your configuration files for master and slave.

I have maintained a 4.0 master / multi-slave setup for 4 years.  It has intra- and inter-colo hops.  I don't think I have ever seen this problem.  Of note is that my system is NOT using SSL.

(FreeBSD, Many 4.0 Mysql versions, currently 4.0.26; often mixed master-slave versions.)

Suggestion: Downgrade to ports/linuxthreads-2.2.3_19
(fwd from Jay J.)

Please also provide output of the command getconf GNU_LIBPTHREAD_VERSION

My machines are based on Debian Sarge.  I will attach the mysql configuration files.  Sveta Smirnova asked for the output of "getconf GNU_LIBPTHREAD_VERSION", it is:

NPTL 0.60

Here is a backtrace from GDB.  This is on the slave system ("sam"), after it gets into the bad state with "Slave_IO_State: Waiting to reconnect after a failed master event read" and "Slave_IO_Running: No".  The system has not crashed, if an event comes in from the master, it will unfreeze and carry on.
So, I attached gdb to the running process.  It is simply waiting in select().

root@sam:~# gdb /usr/sbin/mysqld  2881
GNU gdb 6.3-debian
...

Attaching to program: /usr/sbin/mysqld, process 2881
(no debugging symbols found)
`system-supplied DSO at 0xffffe000' has disappeared; keeping its symbols.
Reading symbols from /lib/tls/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/librt.so.1
Reading symbols from /usr/lib/libz.so.1...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/libz.so.1
Reading symbols from /lib/tls/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libdl.so.2
Reading symbols from /usr/lib/i686/cmov/libssl.so.0.9.7...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/i686/cmov/libssl.so.0.9.7
Reading symbols from /usr/lib/i686/cmov/libcrypto.so.0.9.7...(no debugging symbols found)...done.
Loaded symbols for /usr/lib/i686/cmov/libcrypto.so.0.9.7
Reading symbols from /lib/tls/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread -1211911072 (LWP 2881)]
[New Thread -1304106064 (LWP 2898)]
[New Thread -1303974992 (LWP 2897)]
[New Thread -1303843920 (LWP 2894)]
[New Thread -1267905616 (LWP 2892)]
[New Thread -1267774544 (LWP 2891)]
[New Thread -1267643472 (LWP 2890)]
[New Thread -1295455312 (LWP 2889)]
[New Thread -1287066704 (LWP 2888)]
[New Thread -1278678096 (LWP 2887)]
[New Thread -1257116752 (LWP 2885)]
[New Thread -1248728144 (LWP 2884)]
[New Thread -1240339536 (LWP 2883)]
[New Thread -1231950928 (LWP 2882)]
Loaded symbols for /lib/tls/libpthread.so.0
Reading symbols from /lib/tls/libcrypt.so.1...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libcrypt.so.1
Reading symbols from /lib/tls/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libnsl.so.1
Reading symbols from /lib/tls/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libm.so.6
Reading symbols from /lib/tls/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libc.so.6
Reading symbols from /lib/ld-linux.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/ld-linux.so.2
Reading symbols from /lib/tls/libnss_compat.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libnss_compat.so.2
Reading symbols from /lib/tls/libnss_nis.so.2...
(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libnss_nis.so.2
Reading symbols from /lib/tls/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib/tls/libnss_files.so.2
Reading symbols from /lib/libgcc_s.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib/libgcc_s.so.1
0xb7d0da27 in select () from /lib/tls/libc.so.6
(gdb) bt
#0  0xb7d0da27 in select () from /lib/tls/libc.so.6
#1  0x0816efbe in handle_connections_sockets ()
#2  0x0816ea78 in main ()

I am not allowed to attach files to this bug.  Sorry for the very long comment, but here is the slave configuration file.  The master configuration file is identical, except that it has a different server_id, and has "frodo" wherever the slave has "sam".

# slave

[client]
port		= 3306
socket		= /var/run/mysqld/mysqld.sock

[mysqld]
user		= mysql
pid-file	= /var/run/mysqld/mysqld.pid
socket		= /var/run/mysqld/mysqld.sock
port		= 3306
basedir		= /usr
datadir		= /var/lib/mysql
tmpdir		= /var/tmp
language	= /usr/share/mysql/english

default-character-set=utf8
init-connect=SET NAMES utf8 

ssl-capath		= /etc/mysql
ssl-cert		= /etc/mysql/server-cert.pem
ssl-key			= /etc/mysql/server-key.pem
master-ssl-capath	= /etc/mysql
master-ssl-cert		= /etc/mysql/client-cert.pem
master-ssl-key		= /etc/mysql/client-key.pem

skip-external-locking
skip-bdb

log-bin         = sam
log-bin-index   = sam
log-error       = sam
report-host     = sam
relay-log       = sam-relay-bin
server_id       = 1040428527
master-user     = secret
master-password = secret

slave-net-timeout=45
master-connect-retry=15
relay-log-purge=1

key_buffer		= 16M
max_allowed_packet	= 1M
thread_stack		= 128K
set-variable 		= key_buffer=2M
set-variable 		= myisam_sort_buffer_size=8M
set-variable 		= join_buffer=1M
set-variable 		= record_buffer=1M
set-variable 		= sort_buffer=2M
set-variable 		= thread_cache_size=256
set-variable 		= max_connect_errors=4294967295
set-variable 		= max_connections=500
query_cache_limit 	= 1M
query_cache_size 	= 8M
query_cache_type 	= 1M

old_passwords	= 1

log-slow-queries 	= /var/log/mysql/slow.log
long_query_time		= 5
log-slow-admin-statements
log-warnings

[mysqldump]
quick
quote-names
max_allowed_packet	= 1M

[isamchk]
key_buffer		= 16M

Hello

I do agree that Torrey's bug is similar, yet definitely different from the one we are all experiencing. It might even warrant an own thread.

As for the output of getconf GNU_LIBPTHREAD_VERSION:

That variable does not exist on the Solaris 9 systems I reproduced the bug on. The only "thread" relevant sysvars are:

POSIX_THREAD_ATTR_STACKADDR:    1
POSIX_THREAD_ATTR_STACKSIZE:    1
POSIX_THREAD_PRIORITY_SCHEDULING:       1
POSIX_THREAD_PRIO_INHERIT:      1
POSIX_THREAD_PRIO_PROTECT:      1
POSIX_THREAD_PROCESS_SHARED:    1
POSIX_THREAD_SAFE_FUNCTIONS:    1
PTHREAD_DESTRUCTOR_ITERATIONS:  undefined
PTHREAD_KEYS_MAX:               undefined
PTHREAD_STACK_MIN:              undefined
PTHREAD_THREADS_MAX:            undefined
_POSIX_THREADS:                 1
_XOPEN_REALTIME_THREADS:        1

mysqld is linked against /usr/lib/libthread.so.1, which appears to be the pthread library shipped with Solaris.

And: I've tried 5.0.32 with the keys provided in src/mysql-test/std_data, and - as expected - the behaviour doesn't change. The slave still hangs when issuing a "STOP SLAVE" command. And it hangs forever, or until a "flush logs" or similar is issued on the master.

As to Rick's post: The bug seems to appear in 5.0.23, so 4.0 versions will not be affected. And it only bites when using SSL encryption with openssl (not even yaSSL).

Related to http://bugs.mysql.com/bug.php?id=25203

Also related to http://bugs.mysql.com/bug.php?id=24148

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21111

ChangeSet@1.2458, 2007-03-05 10:07:22+01:00, msvensson@pilot.blaudden +2 -0
  Bug#21781 Replication slave io thread hangs
   - Add test case that shows how slave server hangs in "STOP SLAVE"
     when run on MySQL version 5.0.33 compiled with OpenSSL.
     Works fine with latest version of MySQL since that problem
     has been fixed by patch for bug#24148. The fix has been noted in
     the changelog for MySQL 5.0.36

pushed to 5.0.38, 5.1.17

I can confirm that the bug is no longer present in 5.0.37. Thanks!

Noted in 5.0.36, 5.1.15 changelogs.

SSL connections could hang at connection shutdown.