MySQL Bugs: #34305: show slave status handling segfaults when slave io is about to leave

Bug #34305	show slave status handling segfaults when slave io is about to leave
Submitted:	5 Feb 2008 11:35	Modified:	16 Feb 2008 10:45
Reporter:	Andrei Elkin	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.x, x>=0	OS:	Any
Assigned to:	Andrei Elkin	CPU Architecture:	Any

Description:
Failures on pb with rpl_packet

rpl_packet                     [ fail ]

mysqltest: In included file "./include/wait_for_slave_io_to_stop.inc": At line 21: Error running query ' SHOW SLAVE STATUS': 2013 Lost connection to MySQL server during query

The result from queries just before the failure was:
< snip >
reset slave;
drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
start slave;
drop database if exists DB_NAME_OF_MAX_LENGTH_AKA_NAME_LEN_64_BYTES_____________________;

and slave.err having

080204 18:34:49 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:11310',  replication started in log 'master-bin.000001' at position 2028
080204 18:34:49 [ERROR] Error reading packet from server: Got packet bigger than 'max_allowed_packet' bytes ( server_errno=2020)
080204 18:34:49 [ERROR] Log entry on master is longer than max_allowed_packet (1485) on slave. If the entry is correct, restart the server with a higher value of max_allowed_packet
080204 18:34:49 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 2138
080204 18:34:49 - mysqld got signal 11;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=1048576
read_buffer_size=131072
max_used_connections=2
max_connections=100
threads_connected=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 39424 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd=0xef6ec8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0x408801c0, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x614da7
0x607d5a
0x797174
0x62f1f6
0x6369bc
0x639499
0x63ac33
0x63bfed

seem to be due to a flaw in concurrent accessing the slave's io thd by
the io itself and a handling show slave status thread.

How to repeat:
./mysql-test-run rpl_packet

still it's not so straightforward and requires "adjusting" timing. 
I did it this way:
When the io passed 
   sql_print_information("Slave I/O thread exiting,...)
and reached 
   delete thd;
i stop the thread and let the handling show slave status to reach

    protocol->store(mi->io_thd ? mi->io_thd->proc_info : "", &my_charset_bin);

Now if we let the io to go `thd' will be destroyed and in the following the show threads segfaults because of mi->io_thd->proc_info is not allocated anymore.

Suggested fix:
To implement a TODO left in show_master_info to acquire mi->run_lock mutex prior
protocol->store().

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/41718

ChangeSet@1.2576, 2008-02-05 17:36:26+02:00, aelkin@koti.dsl.inet.fi +2 -0
  Bug #34305 show slave status handling segfaults when slave io is about
      to leave
  
  The artifact was caused by
  a flaw in concurrent accessing the slave's io thd by
  the io itself and a handling show slave status thread.
  Namely, show_master_info did not acquire mi->run_lock mutex that is
  specified for mi->io_thd member.
  
  Fixed with deploying the mutex locking and unlocking. The mutex is kept
  short time and without interleaving with mi->data_lock mutex.
  
  Todo: to report and fix an issue with 
      sys_var_slave_skip_counter::{methods} 
  seem to acquire incorrectly
       active_mi->rli.run_lock
  instead of the specified
       active_mi->rli.data_lock
  
  A test case is difficult to compose, so rpl_packet should continue serving
  as the indicator.

Pushed to mysql-5.0-rpl.

Pushed into 5.0.58

Pushed into 5.1.24-rc

Pushed into 6.0.5-alpha

Documented in the 5.0.48, 5.1.24, and 6.0.5 changelogs as follows:

        SHOW SLAVE STATUS failed when slave I/O was
        about to terminate.

Also noted in the 5.1.23-ndb-6.2.14 and 5.1.23-ndb-6.3.11 changelogs.