Bug #34305 show slave status handling segfaults when slave io is about to leave
Submitted: 5 Feb 2008 11:35 Modified: 16 Feb 2008 10:45
Reporter: Andrei Elkin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.x, x>=0 OS:Any
Assigned to: Andrei Elkin CPU Architecture:Any

[5 Feb 2008 11:35] Andrei Elkin
Description:
Failures on pb with rpl_packet

rpl_packet                     [ fail ]

mysqltest: In included file "./include/wait_for_slave_io_to_stop.inc": At line 21: Error running query ' SHOW SLAVE STATUS': 2013 Lost connection to MySQL server during query

The result from queries just before the failure was:
< snip >
reset slave;
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
start slave;
drop database if exists DB_NAME_OF_MAX_LENGTH_AKA_NAME_LEN_64_BYTES_____________________;

and slave.err having

080204 18:34:49 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:11310',  replication started in log 'master-bin.000001' at position 2028
080204 18:34:49 [ERROR] Error reading packet from server: Got packet bigger than 'max_allowed_packet' bytes ( server_errno=2020)
080204 18:34:49 [ERROR] Log entry on master is longer than max_allowed_packet (1485) on slave. If the entry is correct, restart the server with a higher value of max_allowed_packet
080204 18:34:49 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 2138
080204 18:34:49 - mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=2
max_connections=100
threads_connected=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 39424 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0xef6ec8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x408801c0, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x614da7
0x607d5a
0x797174
0x62f1f6
0x6369bc
0x639499
0x63ac33
0x63bfed

seem to be due to a flaw in concurrent accessing the slave's io thd by
the io itself and a handling show slave status thread.

How to repeat:
./mysql-test-run rpl_packet

still it's not so straightforward and requires "adjusting" timing. 
I did it this way:
When the io passed 
   sql_print_information("Slave I/O thread exiting,...)
and reached 
   delete thd;
i stop the thread and let the handling show slave status to reach

    protocol->store(mi->io_thd ? mi->io_thd->proc_info : "", &my_charset_bin);

Now if we let the io to go `thd' will be destroyed and in the following the show threads segfaults because of mi->io_thd->proc_info is not allocated anymore.

Suggested fix:
To implement a TODO left in show_master_info to acquire mi->run_lock mutex prior
protocol->store().
[5 Feb 2008 15:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41718

ChangeSet@1.2576, 2008-02-05 17:36:26+02:00, aelkin@koti.dsl.inet.fi +2 -0
  Bug #34305 show slave status handling segfaults when slave io is about
      to leave
  
  The artifact was caused by
  a flaw in concurrent accessing the slave's io thd by
  the io itself and a handling show slave status thread.
  Namely, show_master_info did not acquire mi->run_lock mutex that is
  specified for mi->io_thd member.
  
  Fixed with deploying the mutex locking and unlocking. The mutex is kept
  short time and without interleaving with mi->data_lock mutex.
  
  Todo: to report and fix an issue with 
      sys_var_slave_skip_counter::{methods} 
  seem to acquire incorrectly
       active_mi->rli.run_lock
  instead of the specified
       active_mi->rli.data_lock
  
  A test case is difficult to compose, so rpl_packet should continue serving
  as the indicator.
[5 Feb 2008 15:50] Andrei Elkin
Pushed to mysql-5.0-rpl.
[15 Feb 2008 13:40] Bugs System
Pushed into 5.0.58
[15 Feb 2008 13:40] Bugs System
Pushed into 5.1.24-rc
[15 Feb 2008 13:41] Bugs System
Pushed into 6.0.5-alpha
[16 Feb 2008 10:45] Jon Stephens
Documented in the 5.0.48, 5.1.24, and 6.0.5 changelogs as follows:

        SHOW SLAVE STATUS failed when slave I/O was
        about to terminate.
[2 Apr 2008 19:41] Jon Stephens
Also noted in the 5.1.23-ndb-6.2.14 and 5.1.23-ndb-6.3.11 changelogs.