Bug #54509 rpl_show_slave_running crashes the server sporadically
Submitted: 15 Jun 2010 9:15 Modified: 15 Nov 2010 13:12
Reporter: Alexander Nozdrin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:M3 (Celosia), M4 (Dahlia), 6.0 OS:Any
Assigned to: Luis Soares CPU Architecture:Any
Tags: pb2, sporadic, test failure

[15 Jun 2010 9:15] Alexander Nozdrin
Description:
Log extract:

Server [mysqld.2 - pid: 10061, winpid: 10061, exit: 256] failed during test run
Server log from this test:
100611 12:23:27 [Note] Buffered information: Performance schema enabled.

100611 12:23:27 [Note] Plugin 'InnoDB' is disabled.
100611 12:23:27 [Note] Plugin 'FEDERATED' is disabled.
100611 12:23:28 [Note] Event Scheduler: Loaded 0 events
100611 12:23:28 [Note] /export/home2/pb2/test/sb_1-1918902-1276245569.49/mysql-5.5.5-m3-linux-x86_64-test/bin/mysqld: ready for connections.
Version: '5.5.5-m3-debug-log'  socket: '/export/home2/pb2/test/sb_1-1918902-1276245569.49/tmp/pDLD9TNpe1/mysqld.2.sock'  port: 13011  Source distribution
100611 12:23:28 [Note] 'CHANGE MASTER TO executed'. Previous state master_host='', master_port='3306', master_log_file='', master_log_pos='4'. New state master_host='127.0.0.1', master_port='13010', master_log_file='', master_log_pos='4'.
100611 12:23:28 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:13010',replication started in log 'FIRST' at position 4
100611 12:23:28 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
100611 12:23:29 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
100611 12:23:29 [Note] Slave I/O thread killed while reading event
100611 12:23:29 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 107
100611 12:23:29 [Note] Error reading relay log event: slave SQL thread was killed
100611 12:23:29 [Note] Slave I/O thread: connected to master 'root@127.0.0.1:13010',replication started in log 'master-bin.000001' at position 107
100611 12:23:29 [Note] Slave SQL thread initialized, starting replication in log 'master-bin.000001' at position 107, relay log './slave-relay-bin.000003' position: 254
100611 12:23:29 - mysqld got signal 11 ;

How to repeat:
http://tinyurl.com/39af276
[15 Jun 2010 16:44] Sveta Smirnova
Thank you for the report.

Verified as described based on pushbuid logs. Not repeatable in my environment.
[16 Jun 2010 11:27] Luis Soares
All threads bt.

Attachment: threads-bt.txt (text/plain), 7.73 KiB.

[24 Jun 2010 17:56] Luis Soares
1. Looking at the stack traces we get:

Thread 2 (process 8686):
#0  0x0000003c3cecb417 in fdatasync () from /lib64/libc.so.6
#1  0x0000000000a4cb5b in my_sync (fd=51, my_flags=16) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/mysys/my_sync.c:62
#2  0x00000000007def15 in inline_mysql_file_sync (src_file=0xb9e9a8 "/export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/log.cc", 
    src_line=2276, fd=51, flags=16) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/include/mysql/psi/mysql_file.h:1389
#3  0x00000000007df66c in MYSQL_LOG::close (this=0x674cbe8, exiting=3) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/log.cc:2276
#4  0x00000000007df9c1 in MYSQL_BIN_LOG::close (this=0x674cbe0, exiting=3)
#5  0x00000000007e5326 in MYSQL_BIN_LOG::new_file_impl (this=0x674cbe0, need_lock=true)
    at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/log.cc:4061
#6  0x00000000007e6eec in MYSQL_BIN_LOG::new_file (this=0x674cbe0) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/log.cc:3970
#7  0x000000000054bf96 in rotate_relay_log (mi=0x674b320) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/slave.cc:4808
#8  0x000000000054edd1 in process_io_rotate (mi=0x674b320, rev=0x4149ce60)
    at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/slave.cc:3606
#9  0x0000000000551918 in queue_event (mi=0x674b320, buf=0x67df941 "", event_len=44)
    at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/slave.cc:3861
#10 0x0000000000556b44 in handle_slave_io (arg=0x674b320) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/slave.cc:2958
#11 0x00000000009fec5f in pfs_spawn_thread (arg=0x67cb600)
    at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/storage/perfschema/pfs.cc:1011
#12 0x0000003c3da062f7 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003c3ced1b6d in clone () from /lib64/libc.so.6

Thread 1 (process 8616):
#0  0x0000003c3da0b122 in pthread_kill () from /lib64/libpthread.so.0
#1  0x0000000000a4f97b in my_write_core (sig=6) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/mysys/stacktrace.c:326
#2  0x0000000000534403 in handle_segfault (sig=6) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/mysqld.cc:2786
#3  <signal handler called>
#4  0x0000003c3ce30155 in raise () from /lib64/libc.so.6
#5  0x0000003c3ce31bf0 in abort () from /lib64/libc.so.6
#6  0x0000000000a567b3 in safe_mutex_lock (mp=0x674cf98, try_lock=0 '\0', 
    file=0xbf4380 "/export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/mysys/mf_iocache.c", line=1724)
    at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/mysys/thr_mutex.c:159
#7  0x0000000000a2e81c in inline_mysql_mutex_lock (that=0x674cf98, 
    src_file=0xbf4380 "/export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/mysys/mf_iocache.c", src_line=1724)
    at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/include/mysql/psi/mysql_thread.h:636
#8  0x0000000000a305f1 in my_b_flush_io_cache (info=0x674cf38, need_append_buffer_lock=1)
    at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/mysys/mf_iocache.c:1724
#9  0x0000000000a30ef4 in my_b_seek (info=0x674cf38, pos=0) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/mysys/mf_iocache2.c:138
#10 0x000000000069dcb2 in init_master_info (mi=0x674b320, master_info_fname=0xaf634b "master.info", slave_info_fname=0xaf6357 "relay-log.info", 
    abort_if_no_master_info_file=false, thread_mask=2) at /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/sql/rpl_mi.cc:164

2. And when checking the output from the server we see:

   Got error 22 when trying to lock mutex at
   /export/home3/pb2/build/sb_2-1856869-1274975242.48/mysql-5.6.99-m4/mysys/mf_iocache.c,
   line 1724

From #1 and #2, seems that both SQL and IO thread are racing for
the IO_CACHE mutex. This is in fact a very similar scenario to the one
analyzed in BUG#50364, however, it's not the same cause. This one is
caused by an implicit flush of the IO_CACHE when my_b_seek is called
by SQL thread (at init_master_info time).

So, it seems that:

 1. IO thread rotates the log
 2. SQL thread does not check the log_lock and reaches for the
    IO_CACHE
 3. IO thread closes the log (after calling end_io_cache)
 4. SQL reaches out for an invalid IO_CACHE.

I guess we need to protect this seek with log_lock.
[24 Jun 2010 18:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/112134

3096 Luis Soares	2010-06-24
      BUG#54509: rpl_show_slave_running crashes the server sporadically
      
      Problem: SQL and IO thread were racing for the IO_CACHE. The former to
      flush it, the latter to close it. In some cases this would cause the
      SQL thread to lock an invalid IO_CACHE mutex (it had been destroyed by
      IO thread). This would happen when SQL thread was initializing the
      master.info
      
      Solution: We solve this by locking the log and checking if it is
      hot. If it is we keep the log while seeking. Otherwise we release it
      right away, because a log can get from hot to cold, but not from cold
      to hot.
[8 Jul 2010 9:29] Luis Soares
Queued in mysql-trunk-bugfixing and merged to mysql-next-mr-bugfixing:
- http://pb2.norway.sun.com/web.py?template=push_details&push=1402801
- http://pb2.norway.sun.com/web.py?template=push_details&push=1402747
[23 Jul 2010 12:27] Bugs System
Pushed into mysql-trunk 5.5.6-m3 (revid:alik@sun.com-20100723121820-jryu2fuw3pc53q9w) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)
[23 Jul 2010 12:34] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100723121929-90e9zemk3jkr2ocy) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (pib:18)
[6 Aug 2010 14:39] Jon Stephens
Need to know for docs -- Was this only a test issue, or could users hit this bug, and if so, how?

Thanks!
[2 Sep 2010 11:23] Luis Soares
Hi Jon,

There could be a potential server crash when SQL thread was
started and at the same time IO thread rotates the hot relay log.
They would race for the same IO cache (one thread would close it
and the other would try to flush it after it was closed,
effectively grabbing an invalide mutex).

So, Jon, answering your questions:

  1. No, it's not a test only issue.

  2. Yes, users could hit this bug (seldomly), if they were
     unlucky enough to:

      a. start the SQL thread while the IO thread was rotating
         the hot relay log;

      b. IO thread races SQL thread and destroys the IO cache and
         mutex before SQL thread flushes it.

Cheers,
Luís
[2 Sep 2010 15:26] Jon Stephens
Thank you for your bug report. This issue has already been fixed in the latest released version of that product, which you can download at

  http://www.mysql.com/downloads/
[2 Sep 2010 15:27] Jon Stephens
Documented bugfix in the 5.5.6 changelog as follows:

        If the SQL thread was started while the I/O thread rotated the
        hot relay log, the 2 threads could race for the same I/O cache,
        leading to a crash of the server.

Closed.
[18 Oct 2010 11:44] Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101018114405-6v4zrv5vjx9amgnn) (version source revid:alexander.nozdrin@oracle.com-20101018114405-6v4zrv5vjx9amgnn) (pib:21)
[13 Nov 2010 16:04] Bugs System
Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:vasil.dimov@oracle.com-20100629074804-359l9m9gniauxr94) (merge vers: 5.6.99-m4) (pib:21)
[15 Nov 2010 13:12] Jon Stephens
Set back to Closed. See previous comments.