Bug #50364 FLUSH LOGS crashes the server (rpl.rpl_heartbeat_basic fails in PB sporadically)
Submitted: 15 Jan 2010 12:49 Modified: 14 Oct 2010 12:27
Reporter: Alexander Nozdrin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:Betony (M2), Celosia (M3) OS:Any
Assigned to: Luis Soares CPU Architecture:Any
Tags: pb2, sporadic, test failure

[15 Jan 2010 12:49] Alexander Nozdrin
Description:
rpl.rpl_heartbeat_basic fails in PB sporadically due to server crash
after FLUSH LOGS query:

CURRENT_TEST: rpl.rpl_heartbeat_basic
mysqltest: At line 390: query 'FLUSH LOGS' failed: 2013: Lost connection to MySQL server during query

Log extract will be attached.

How to repeat:
XRef2: http://tinyurl.com/y94qr8b
[15 Jan 2010 12:50] Alexander Nozdrin
Log extract

Attachment: log (application/octet-stream, text), 22.89 KiB.

[17 Jan 2010 7:33] Sveta Smirnova
Thank you for the report.

Set to "Verified" as this documented in pushbuild. Not repeatable on my test box.
[22 Jan 2010 10:09] Luis Soares
I have come accross this stack trace while checking out this
failure:

Got error 22 when trying to lock mutex at mf_iocache.c, line 1724

[...]

Thread 3 (process 8088):
#0  0x0000003a232cb417 in fdatasync () from /lib64/libc.so.6
#1  0x0000000000f02f5b in my_sync (fd=31, my_flags=16) at my_sync.c:62
#2  0x00000000008bb5cf in MYSQL_BIN_LOG::sync_purge_index_file (this=0x7c5a7d0)
    at log.cc:3498
#3  0x00000000008c2b2c in MYSQL_BIN_LOG::open (this=0x7c5a7d0, 
    log_name=0x2aaab0005008 "slave-relay-bin", log_type_arg=LOG_BIN, 
    new_name=0x4203a000 "./slave-relay-bin.000052", 
    io_cache_type_arg=SEQ_READ_APPEND, no_auto_events_arg=false, 
    max_size_arg=1073741824, null_created_arg=true, need_mutex=false)
    at log.cc:2683
#4  0x00000000008c473d in MYSQL_BIN_LOG::new_file_impl (this=0x7c5a7d0, 
    need_lock=true) at log.cc:3943
#5  0x00000000008c7270 in MYSQL_BIN_LOG::new_file (this=0x7c5a7d0)
    at log.cc:3834
#6  0x0000000000a2c20b in rotate_relay_log (mi=0x7c58f10) at slave.cc:4724
#7  0x0000000000758a35 in reload_acl_and_cache (thd=0x7c66b78, options=2, 
    tables=0x0, write_to_binlog=0x4203afff) at sql_parse.cc:6872
#8  0x000000000076acac in mysql_execute_command (thd=0x7c66b78)
    at sql_parse.cc:3879
#9  0x0000000000770841 in mysql_parse (thd=0x7c66b78, 
    inBuf=0x7ca77d8 "FLUSH LOGS", length=10, found_semicolon=0x4203beb0)
    at sql_parse.cc:5987
#10 0x0000000000772776 in dispatch_command (command=COM_QUERY, thd=0x7c66b78, 
    packet=0x7c9f749 "FLUSH LOGS", packet_length=10) at sql_parse.cc:1140
#11 0x0000000000774811 in do_command (thd=0x7c66b78) at sql_parse.cc:810
#12 0x00000000007548dc in do_handle_one_connection (thd_arg=0x7c66b78)
    at sql_connect.cc:1172
#13 0x0000000000754a9b in handle_one_connection (arg=0x7c66b78)
    at sql_connect.cc:1112
#14 0x0000000000ecf4b6 in pfs_spawn_thread (arg=0x7c82be8) at pfs.cc:1011
#15 0x0000003a23e062f7 in start_thread () from /lib64/libpthread.so.0
#16 0x0000003a232d1b6d in clone () from /lib64/libc.so.6

[...] 

Thread 1 (process 8441):
#0  0x0000003a23e0b122 in pthread_kill () from /lib64/libpthread.so.0
#1  0x0000000000f197c2 in my_write_core (sig=6) at stacktrace.c:326
#2  0x000000000074219e in handle_segfault (sig=6) at mysqld.cc:2712
#3  <signal handler called>
#4  0x0000003a23230155 in raise () from /lib64/libc.so.6
#5  0x0000003a23231bf0 in abort () from /lib64/libc.so.6
#6  0x0000000000f2ed11 in safe_mutex_lock (mp=0x7c5ab88, try_lock=0 '\0', 
    file=0x11c126e "mf_iocache.c", line=1724) at thr_mutex.c:159
#7  0x0000000000edd16f in inline_mysql_mutex_lock (that=0x7c5ab88, 
    src_file=0x11c126e "mf_iocache.c", src_line=1724)
    at ../include/mysql/psi/mysql_thread.h:519
#8  0x0000000000ee00fd in my_b_flush_io_cache (info=0x7c5ab28, 
    need_append_buffer_lock=1) at mf_iocache.c:1724
#9  0x0000000000a5734b in flush_master_info (mi=0x7c58f10, 
    flush_relay_log_cache=true) at rpl_mi.cc:422
#10 0x0000000000a3e1fa in handle_slave_io (arg=0x7c58f10) at slave.cc:2895
#11 0x0000000000ecf4b6 in pfs_spawn_thread (arg=0x2aaab0002cd8) at pfs.cc:1011
#12 0x0000003a23e062f7 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003a232d1b6d in clone () from /lib64/libc.so.6

Full stack trace at: http://tinyurl.com/y8mbwxk

It seems that there is some concurrent access to the IO_CACHE and
probably there should not.

Looking at the source code, I find that:

  1. Thread #3 closes and reopens the log file with LOCK_log
     taken. See log.cc:MYSQL_LOG_BIN::new_file_impl: 

    open(old_name, log_type, new_name_ptr,
         io_cache_type, no_auto_events, max_size, 1, FALSE);

  2. Thread #3, during closing and reopening of log file, it also
     closes and reopens the respective IO_CACHE.  Eventually, the
     IO_CACHE mutex is re-initialized (a call to mysql_mutex_init
     is performed down the stack - in
     mysys/mf_iocache.c:init_io_cache):

#ifdef THREAD
  mysql_mutex_init(key_IO_CACHE_append_buffer_lock,
                   &info->append_buffer_lock, MY_MUTEX_INIT_FAST);
#endif
  }
#if defined(SAFE_MUTEX) && defined(THREAD)
  else
  {
    /* Clear mutex so that safe_mutex will notice that it's not initialized */
    bzero((char*) &info->append_buffer_lock, sizeof(info));
  }
#endif

   3. During/After #1 and #2, Thread #1 tries to flush master
      info. Now, seems it does so without grabbing a lock on
      LOCK_log. This results in a race between the threads, to
      the IO_CACHE. This happens when IO thread calls
      flush_master_info after it has queued an event in the relay
      log (see slave.cc:handle_slave_io):

       if (flush_master_info(mi, 1))
       {
         sql_print_error("Failed to flush master info file");
         goto err;
       }
      
      Inside flush_master_info, there is the following piece of
      code:

      if (flush_relay_log_cache)
      {
        IO_CACHE *log_file= mi->rli.relay_log.get_log_file();
        if (flush_io_cache(log_file))
          DBUG_RETURN(2);
      }

      Now, if doing this without the mutex taken, then log_file
      mutex can be silently re-initialized (for instance, by
      Thread #3) while this thread is using it (for instance,
      while calling safe_mutex_lock).

      Before the stack trace above there is a message output by
      the server:
       
      "Got error 22 when trying to lock mutex at mf_iocache.c,
       line 1724"

      Error 22 means:
      EINVAL		22	/* Invalid argument */

      and for example, checking the manpages, for
      pthread_mutex_lock, EINVAL return value means:

      "EINVAL the mutex has not been properly initialized."

So seems that this is indeed a race between the IO thread and the
user thread that issues the FLUSH LOGS (and while rotating the
relay log).
[22 Jan 2010 10:30] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/97836

2967 Luis Soares	2010-01-22
      BUG#50364: FLUSH LOGS crashes the server (rpl.rpl_heartbeat_basic
      fails in PB sporadically)
      
      The IO thread can concurrently access the relay log IO_CACHE
      while another thread is performing an FLUSH LOGS procedure.
      
      FLUSH LOGS closes and reopens the relay log and while doing so it
      (re)initializes its IO_CACHE. During this procedure the IO_CACHE
      mutex is also reinitialized, which can cause problems if some
      other thread (namely the IO THREAD) is concurrently accessing it
      at the time .
      
      This patch fixes the problem by extending the interface of the
      flush_master_info function to also include a flag to state
      whether it should grab the LOCK_log or not before actually
      flushing the relay log. Also, IO thread now calls
      flush_master_info with the NEED_LOG_LOCK flag set when it flushes
      master info with FLUSH_RELAY_LOG_CACHE flag.
     @ sql/rpl_mi.cc
        Changed flush_master_info so that the flag NEED_LOG_LOCK is 
        checked before actually flushing the relay log IO_CACHE.
     @ sql/rpl_mi.h
        Changed the interface of flush_master_info so that it takes 
        an integer instead of a boolean because an extra flag was
        added: NEED_LOG_LOCK. This way, flags are packed in an integer,
        and can be decoded using bitwise operators.
     @ sql/slave.cc
        Made flush_master_info to be called with NEED_LOG_LOCK when
        IO Thread is flushing it without having the lock grabbed.
     @ sql/sql_repl.cc
        Changed flush_master_info call to have FLUSH_MI_EMPTY_FLAGS
        instead of 0 (it's the same thing, except that it is more
        compliant with the flags idea now).
[29 Jan 2010 0:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/98524

2968 Luis Soares	2010-01-29
      BUG#50364: FLUSH LOGS crashes the server (rpl.rpl_heartbeat_basic
      fails in PB sporadically)
      
      Incremental commit addressing reviews comments:
      
        - Increased loop time in rpl_heartbeat_basic test case.
        - Reverted bitwise flags and added a second parameter instead:
          need_lock_relay_log.
     @ mysql-test/suite/rpl/t/rpl_heartbeat_basic.test
        Doubled the number of iterations on the FLUSH LOGS loop by
        doubling the time available to perform all iterations.
     @ sql/rpl_mi.cc
        Updating flush_master_info call so that it uses two parameters
        instead of the originally proposed encoded flags.
     @ sql/rpl_mi.h
        Changed flush_master_info interface. Now takes a second parameter
        instead of just one. The second parameter is: need_lock_relay_log.
     @ sql/rpl_rli.cc
        Small fix in comment.
     @ sql/slave.cc
        Updating flush_master_info call so that it uses two parameters
        instead of the originally proposed encoded flags.
     @ sql/sql_repl.cc
        Updating flush_master_info call so that it uses two parameters
        instead of the originally proposed encoded flags.
[3 Feb 2010 16:56] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99121

2933 Luis Soares	2010-02-03
      BUG#50364: FLUSH LOGS crashes the server (rpl.rpl_heartbeat_basic
      fails in PB sporadically)
            
      The IO thread can concurrently access the relay log IO_CACHE
      while another thread is performing an FLUSH LOGS procedure.
            
      FLUSH LOGS closes and reopens the relay log and while doing so it
      (re)initializes its IO_CACHE. During this procedure the IO_CACHE
      mutex is also reinitialized, which can cause problems if some
      other thread (namely the IO THREAD) is concurrently accessing it
      at the time .
            
      This patch fixes the problem by extending the interface of the
      flush_master_info function to also include a second paramater, 
      "need_relay_log_lock", stating whether the thread should grab the 
      relay log lock or not before actually flushing the relay log. 
      Also, IO thread now calls flush_master_info with this flag set 
      when it flushes master info with in the event read_event loop.
      
      Finally, we also increase loop time in rpl_heartbeat_basic test 
      case, so that the number of calls to flush logs doubles, stressing
      this part of the code a little more.
     @ mysql-test/suite/rpl/t/rpl_heartbeat_basic.test
        Doubled the number of iterations on the FLUSH LOGS loop by
        doubling the time available to perform all iterations.
     @ sql/repl_failsafe.cc
        Updating flush_master_info call so that it uses two parameters
        instead of one.
     @ sql/rpl_mi.cc
        Updating flush_master_info call so that it uses two parameters
        instead of one.
     @ sql/rpl_mi.h
        Changed flush_master_info interface. Now takes a second parameter
        instead of just one. The second parameter is: need_lock_relay_log.
     @ sql/rpl_rli.cc
        Small fix in comment.
     @ sql/slave.cc
        Updating flush_master_info call so that it uses two parameters
        instead of one.
     @ sql/sql_repl.cc
        Updating flush_master_info call so that it uses two parameters
        instead of one.
[3 Feb 2010 17:21] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/99127

2981 Luis Soares	2010-02-03 [merge]
      BUG#50364: manual merge to mysql-next-mr-bugfixing.
                 
      Conflicts
      =========
      Text conflict in sql/repl_failsafe.cc
      
      Additional changes
      ==================
      Replace references to pthread_mutex with mysql_mutex
[12 Feb 2010 17:41] Bugs System
Pushed into 5.5.2-m2 (revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (version source revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (merge vers: 5.5.2-m2) (pib:16)
[13 Feb 2010 8:36] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100213083436-9pesg4h55w1mekxc) (version source revid:luis.soares@sun.com-20100211135109-t63avry9fqpgyh78) (merge vers: 6.0.14-alpha) (pib:16)
[13 Feb 2010 8:38] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100213083327-cee4ao3jpg33eggv) (version source revid:luis.soares@sun.com-20100211135018-1f9dbghg0itszigo) (pib:16)
[13 Feb 2010 10:21] Jon Stephens
Documented bugfix int he 5.5.2 and 6.0.14 changelogs as follows:

      FLUSH LOGS could in some circumstances crash the server. This 
      occurred because the I/O thread could concurrently access the 
      relay log I/O cache while another thread was performing the FLUSH 
      LOGS, which closes and reopens the relay log and, while doing so, 
      (re)initializes its I/O cache. This could cause problems if some 
      other thread (in this case, the I/O thread) is accessing it at the 
      same time.
      
      Now the thread performing the FLUSH LOGS takes a lock on the 
      relay log before actually flushing it.

Closed.
[22 Feb 2010 0:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101011

2982 Luis Soares	2010-02-22
      Post-push fix for BUG#50364.
      
      There was an erroneous parameter when calling flush_master_info
      from write_ignored_events_info_to_relay_log which could lead to a
      server crash. This happens because the I/O thread releases the
      log_lock before calling the flush_master_info.
      
      Set the function to call flush_master_info with third parameter
      to true, so that the mutex is properly taken.
[22 Feb 2010 0:20] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101012

2974 Luis Soares	2010-02-22 [merge]
      BUG#50364: Automerge from mysql-trunk-bugfixing.
[22 Feb 2010 0:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101013

2942 Luis Soares	2010-02-22 [merge]
      BUG#50364: Manual merge (empty) from mysql-next-mr-bugfixing.
      
      In 6.0 codebase, there was no issue.
[25 Feb 2010 19:47] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100225194420-p60r4u90tszj8q2x) (version source revid:luis.soares@sun.com-20100222003533-27k9jj9100kc6atv) (merge vers: 6.0.14-alpha) (pib:16)
[25 Feb 2010 19:47] Bugs System
Pushed into 5.5.3-m2 (revid:alik@sun.com-20100225194201-ybqxv77ofcxib3or) (version source revid:luis.soares@sun.com-20100222002629-0cijwqk6baxhj7gr) (merge vers: 5.5.3-m2) (pib:16)
[25 Feb 2010 19:49] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100225194305-h49uyjrlfl3mwo60) (version source revid:luis.soares@sun.com-20100222002746-a2t4cikxmml1h4cm) (pib:16)
[25 Feb 2010 20:51] Paul DuBois
Noted in 5.5.3, 6.0.14 changelogs.

On a replication slave, a race condition between the I/O thread and
FLUSH LOGS could crash the server.

Setting report to Need Merge pending push of Celosia to release tree.
[6 Mar 2010 10:51] Bugs System
Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:alik@sun.com-20100225195857-farb6yvy8x06bylj) (merge vers: 5.5.99-m3) (pib:16)
[17 Mar 2010 11:52] Luis Soares
See also: BUG#38240.
[19 Jul 2010 14:34] Bugs System
Pushed into 5.1.49 (revid:build@mysql.com-20100719143034-omcma40sblwmay3x) (version source revid:luis.soares@sun.com-20100628220900-yduk34aymksoboy1) (merge vers: 5.1.48) (pib:16)
[20 Jul 2010 2:25] Paul DuBois
Noted in 5.1.49 changelog.
[22 Jul 2010 13:40] Paul DuBois
The 5.1.49 push message was spurious. This fix was not pushed to 5.1.49. See Bug#53657, which is a similar fix, but pushed to 5.1.49.
[23 Jul 2010 12:21] Bugs System
Pushed into mysql-trunk 5.5.6-m3 (revid:alik@sun.com-20100723121820-jryu2fuw3pc53q9w) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)
[23 Jul 2010 12:29] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100723121929-90e9zemk3jkr2ocy) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (pib:18)
[3 Aug 2010 13:19] Jon Stephens
Already documented for 5.5.3. Closed.
[14 Oct 2010 8:26] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:martin.skold@mysql.com-20101014082627-jrmy9xbfbtrebw3c) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:41] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:martin.skold@mysql.com-20101014083757-5qo48b86d69zjvzj) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:56] Bugs System
Pushed into mysql-5.1-telco-6.2 5.1.51-ndb-6.2.19 (revid:martin.skold@mysql.com-20101014084420-y54ecj85j5we27oa) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 12:27] Jon Stephens
No new changelog entry required. Setting back to Closed state.