Bug #68229 MTS may get SIGSEGV because of partial deletes on rli->curr_group_da
Submitted: 31 Jan 2013 1:34 Modified: 31 Jan 2013 14:03
Reporter: Yoshinori Matsunobu (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6.9 OS:Any
Assigned to: CPU Architecture:Any

[31 Jan 2013 1:34] Yoshinori Matsunobu
Description:
Around sql/log_event.cc#apply_event() line 3070:
-----------
err:
  if (thd->is_error())
  {
    DBUG_ASSERT(!worker);

    // destroy buffered events of the current group prior to exit
    for (uint k= 0; k < rli->curr_group_da.elements; k++)
    {
      delete *(Log_event**) dynamic_array_ptr(&rli->curr_group_da, k);
    }
  }
-----------

rli->curr_group_da.elements is not reset here, so if entering the
same delete loop, mysqld does double free, which will result in SIGSEGV.

How to repeat:
On my test environment, I encountered crash loop by the following steps.

1. One master and one slave, Multi-Threaded slave enabled, 100 databases and 100 workers
2. Running heavy concurrent inserts on master
3. killing -9 slave during loads
4. Restarting mysqld
5. start slave (keeps crashing)

Here is a core dump of the crashed slave.

Program terminated with signal 11, Segmentation fault.
#0  0x0000003b1280b122 in pthread_kill () from /lib64/libpthread.so.0
#1  0x000000000066a70a in handle_fatal_signal (sig=11)
    at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/sql/signal_handler.cc:248
#2  <signal handler called>
#3  0x00000000008b9ea3 in slave_stop_workers (rli=0x7f1594069c50,
    mts_inited=0x4099d0ef)
    at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/sql/rpl_slave.cc:5256
#4  0x00000000008c6889 in handle_slave_sql (arg=<optimized out>)
    at /export/home/pb2/build/sb_0-7655600-1353595193.21/mysql-5.6.9-rc/sql/rpl_slave.cc:5592
#5  0x0000003b128062f7 in start_thread () from /lib64/libpthread.so.0
#6  0x0000003b120d1e3d in clone () from /lib64/libc.so.6
#7  0x0000000000000000 in ?? ()

On rpl_slave.cc:5256, there is another delete loop.
----
  for (uint i= 0; i < rli->curr_group_da.elements; i++)
    delete *(Log_event**) dynamic_array_ptr(&rli->curr_group_da, i);
  delete_dynamic(&rli->curr_group_da);             // GCDA
----
I'm confident that rli->curr_group_da[i] was already deleted on log_event.cc#apply_event() (verified by debugger), so this is SIGSEGV caused by double free.

By adding below line after apply_event() delete loop on log_event.cc, the crash loop went away.
---
delete_dynamic(&rli->curr_group_da);
---
[31 Jan 2013 14:03] Erlend Dahl
This has been fixed in 5.6.10 together with the fix for Bug #67798