MySQL Bugs: #22993: Master hangs in SSL replication when the slave runs out of disk space

Bug #22993	Master hangs in SSL replication when the slave runs out of disk space
Submitted:	4 Oct 2006 19:59	Modified:	13 Jul 2007 16:58
Reporter:	Harrison Fisk	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.42	OS:	Linux (Linux 2.6.15 (Ubuntu 6.06 LTS))
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	bfsm_2007_03_01

Description:
When replication is setup to use SSL connections, if the slave gets a disk full error while writing to the relay log, it results in the master server hanging after a short time period.  The master will not accept any new connections nor will execute any queries from existing connections.  When the slave is given more space, the master will resume processing everything like normal.

This does not seem reproducible with non-SSL replication.

This is similar to issue #22082, however this is more severe as the slave can lock up the master.

How to repeat:
1.  Setup replication using SSL
2.  On the slave execute: STOP SLAVE SQL_THREAD;
3.  Fill up the slave's partition that stores the relay logs so that it is completely full
4.  Issue a lot of replicated commands on the master (need to do quite a few due to space allocation, the world.sql worked for me)
5.  Error 28 is received in the error log on the slave for the IO_THREAD writing relay logs
6.  Wait a bit of time and the master will become unresponsive, all queries will hang, all new connections will hang (it can sometimes take a few minutes to get this way)
7.  Free up some space on the slave and the master becomes responsive again

Suggested fix:
Allow the server to continue handling queries and connections even when the IO_THREAD connection is hung.

A workaround is to not run out of disk space on your slave or to not use SSL replication.

While I was debugging the issue I got the following two backtraces, not sure if either of them are helpful or not:

(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7ee09f8 in send () from /lib/tls/i686/cmov/libpthread.so.0
#2  0x083f00a4 in yaSSL::Socket::send (this=0x8c3512c,
    buf=0x8c6aee8 "\027\003\001", sz=181, flags=0) at socket_wrapper.cpp:122
#3  0x083e59d1 in yaSSL::SSL::Send (this=0x8c347c8,
    buffer=0x8c6aee8 "\027\003\001", sz=181) at yassl_int.cpp:1013
#4  0x083ef5fc in yaSSL::sendData (ssl=@0x8c347c8, buffer=0x8c63030, sz=142)
    at handshake.cpp:892
#5  0x083d8685 in yaSSL_write (ssl=0x8c347c8, buffer=0x8c63030, sz=142)
    at ssl.cpp:211
#6  0x0839a852 in vio_ssl_write (vio=0x8c4d0b0, buf=0x8c63030 "\212", size=142)
    at viossl.c:104
#7  0x08172fa0 in net_real_write (net=0x8c4bf0c, packet=0x8c63030 "\212",
    len=142) at net_serv.cc:608
#8  0x081729f5 in net_flush (net=0x8c4bf0c) at net_serv.cc:333
#9  0x082715b7 in mysql_binlog_send (thd=0x8c4b700,
    log_ident=0x8c6cb20 "hfisk-desktop-bin.000008", pos=4057, flags=0)
    at sql_repl.cc:574
#10 0x0819069a in dispatch_command (command=COM_BINLOG_DUMP, thd=0x8c4b700,
    packet=0x8c63031 "", packet_length=35) at sql_class.h:725
#11 0x0818f903 in do_command (thd=0x8c4b700) at sql_parse.cc:1538
#12 0x0818ec72 in handle_one_connection (arg=0xfffffe00) at sql_parse.cc:1175
#13 0xb7edb341 in start_thread () from /lib/tls/i686/cmov/libpthread.so.0
#14 0xb7e084ee in clone () from /lib/tls/i686/cmov/libc.so.6

(gdb) bt
#0  0xffffe410 in __kernel_vsyscall ()
#1  0xb7ee02ae in __lll_mutex_lock_wait ()
   from /lib/tls/i686/cmov/libpthread.so.0
#2  0xb7edcfbb in _L_mutex_lock_33 () from /lib/tls/i686/cmov/libpthread.so.0
#3  0xbfb11308 in ?? ()
#4  0x00000010 in ?? ()
#5  0x083b49bb in safe_mutex_lock (mp=0x85ed520,
    file=0x841a4b3 "mysql_priv.h", line=1534) at thr_mutex.c:116
#6  0x0816bc13 in THD (this=0x8c67040) at mysql_priv.h:1534
#7  0x0817e25b in handle_connections_sockets (arg=0x0) at sql_list.h:421
#8  0x0817d824 in main (argc=2, argv=0xbfb11574) at mysqld.cc:3523

This bug also appears to be triggered when you run out of space due to relay_log_space_limit.  You can repeat it very easily by setting relay_log_space_limit to a small size and then turn off the SQL_THREAD.

I've been unable to reproduce this manually. I'm checking to see if a test case can be written to reproduce the problem.

Test case to reproduce the hang

Attachment: rpl_ssl_hang.test (application/octet-stream, text), 1.24 KiB.

opt file for test case

Attachment: rpl_ssl_hang-slave.opt (application/octet-stream, text), 29 bytes.

I have uploaded a mysql-test case to this issue.

I have run it against MySQL 5.0.42 as:

/usr/local/mysql/mysql-test$ ./mysql-test-run rpl_ssl_hang

In another shell, I then connected to the master database and listed the processlist over and over as:

mysqladmin -i 1 -P 9306 -h 127.0.0.1 pro

After a bit of time, the display would freeze up and stop producing output.  The amount of time would vary from about 60 seconds to 130 seconds, but it would always do so for me.

The process on the master would be 'Writing to net' and the slave would be 'Waiting for the slave SQL thread to free enough relay log space'.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/29784

ChangeSet@1.2507, 2007-06-27 16:46:23-04:00, dkatz@damien-katzs-computer.local +1 -0
  Bug #22993  	Master hangs in SSL replication when the slave runs out of disk space
  
  Removed unused close_notify "alert" that was causing hangs when the connection was paused or slow.

Duplicate of Bug #29579  Clients using SSL can hang the server