MySQL Bugs: #68506: Got SIGSEGV on MTS recovery + SQL thread error

Bug #68506	Got SIGSEGV on MTS recovery + SQL thread error
Submitted:	27 Feb 2013 7:23	Modified:	16 May 2013 17:14
Reporter:	Yoshinori Matsunobu (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6.10	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
When I tried "how to repeat" steps, mysqld got SIGSEGV.

----
07:03:04 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=101
max_threads=5024
thread_count=313
connection_count=1
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 2007704 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0xf290e80
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 593db0b0 thread_stack 0x40000
/data/mysql5610/bin/mysqld(my_print_stacktrace+0x35)[0x8f0c35]
/data/mysql5610/bin/mysqld(handle_fatal_signal+0x3e8)[0x66b0f8]
/lib64/libpthread.so.0[0x3b1280de70]
/data/mysql5610/bin/mysqld(_Z26apply_event_and_update_posPP9Log_eventP3THDP14Relay_log_info+0xee)[0x8c595e]
/data/mysql5610/bin/mysqld[0x8c6518]
/data/mysql5610/bin/mysqld(handle_slave_sql+0xca9)[0x8c7829]
/data/mysql5610/bin/mysqld(pfs_spawn_thread+0x13b)[0x932cbb]
/lib64/libpthread.so.0[0x3b128062f7]
/lib64/libc.so.6(clone+0x6d)[0x3b120d1e3d]
-----

As far as digging into core files, mysqld crashed here.
---
apply_event_and_update_pos(Log_event** ptr_ev, THD* thd, Relay_log_info* rli)
  if (!(rli->is_mts_recovery() && bitmap_is_set(&rli->recovery_groups,
                                                rli->mts_recovery_index)))
---

At #7 on below "how to repeat", rli->is_mts_recovery() was true but
rli->recovery_groups.bitmap was 0x0. So bitmap_is_set() raised SIGSEGV.
$8 = {bitmap = 0x0, n_bits = 524280, last_word_mask = 4278190080, last_word_ptr = 0x34ce18c, mutex = 0x0}

When SQL thread terminates (including by error), rli->recovery_groups is freed.
handle_slave_sql()
  if (rli->recovery_groups_inited)
  {
    bitmap_free(&rli->recovery_groups);
    rli->recovery_groups_inited= false;
  }

rli->recovery_groups looks allocated on Relay_log_info global instance creation phase, but recovery_groups looks never re-initialized after "bitmap_free(&rli->recovery_groups)".

How to repeat:
1. Enable MTS on slave (Set slave_parallel_workers large enough)
2. Insert into master databases from multiple clients (I tested 100 databases from 100 clients)
3. Kill the slave mysqld when the slave delays
4. Restart the slave with --skip-slave-start
5. Manually insert missing rows to slave (to cause #6 intentionally)
6. START SLAVE. Then SQL thread stops with duplicate key error
7. START SLAVE again.

I mistyped title:)

After some effort I was able to crash 5.6.10 here:

mysqld.exe!apply_event_and_update_pos()[rpl_slave.cc:3306]
mysqld.exe!exec_relay_log_event()[rpl_slave.cc:3707]
mysqld.exe!handle_slave_sql()[rpl_slave.cc:5516]
mysqld.exe!pfs_spawn_thread()[pfs.cc:1856]
mysqld.exe!pthread_start()[my_winthread.c:63]
mysqld.exe!_callthreadstartex()[threadex.c:314]
mysqld.exe!_threadstartex()[threadex.c:292]

rli was 0x00000000 here:

if (!(rli->is_mts_recovery() && bitmap_is_set(&rli->recovery_groups,
                                                rli->mts_recovery_index)))
  {
    reason= ev->shall_skip(rli);
  }

And on debug build, I hit exact crash:

mysqld-debug.exe!bitmap_is_set()[my_bitmap.h:101]
mysqld-debug.exe!apply_event_and_update_pos()[rpl_slave.cc:3306]
mysqld-debug.exe!exec_relay_log_event()[rpl_slave.cc:3701]
mysqld-debug.exe!handle_slave_sql()[rpl_slave.cc:5516]
mysqld-debug.exe!pfs_spawn_thread()[pfs.cc:1855]
mysqld-debug.exe!pthread_start()[my_winthread.c:62]
mysqld-debug.exe!_callthreadstartex()[threadex.c:314]
mysqld-debug.exe!_threadstartex()[threadex.c:297]

also repeatable on latest 5.6.11 from internal bzr.

Hi Yoshinori,
 Will the attached patch fix your problem? Let me know if
 it does not.  Thanks!

=== modified file 'sql/rpl_slave.cc'
--- sql/rpl_slave.cc	revid:saikumar.v@oracle.com-20130318054358-77zqaztvuroujo5s
+++ sql/rpl_slave.cc	revid:manish.4.kumar@oracle.com-20130318070143-ec45rxdkg37be0u2
@@ -5622,6 +5622,7 @@
   if (rli->recovery_groups_inited)
   {
     bitmap_free(&rli->recovery_groups);
+    rli->mts_recovery_group_cnt= 0;
     rli->recovery_groups_inited= false;
   }
 
Regards,
Manish Kumar

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Fixed in 5.6+. Documented as follows in the 5.6.12 and 5.7.2 changelogs:

        An SQL thread error during MTS slave recovery caused the slave
        to fail.

Closed.

http://bugs.mysql.com/bug.php?id=69126 marked as duplicate of this one.