Bug #53657 Slave crashed with error 22 when trying to lock mutex at mf_iocache.c, line 1722
Submitted: 14 May 2010 16:32 Modified: 14 Oct 2010 12:31
Reporter: Elena Stepanova Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.1.47-debug OS:Any
Assigned to: Luis Soares CPU Architecture:Any
Triage: Triaged: D1 (Critical) / R2 (Low) / E1 (None/Negligible)

[14 May 2010 16:32] Elena Stepanova
Description:
#5  0x0000003520c31bf0 in abort () from /lib64/libc.so.6
#6  0x00000000009f29b4 in safe_mutex_lock (mp=0x20a0ba70, try_lock=0 '\0', file=0xbcef2e "mf_iocache.c", line=1722) at thr_mutex.c:156
#7  0x00000000009c99a8 in my_b_flush_io_cache (info=0x20a0ba10, need_append_buffer_lock=1) at mf_iocache.c:1722
#8  0x00000000008155ac in flush_master_info (mi=0x20a09e40, flush_relay_log_cache=true) at rpl_mi.cc:361
#9  0x00000000008077c5 in handle_slave_io (arg=0x20a09e40) at slave.cc:2734
#10 0x00000035218062f7 in start_thread () from /lib64/libpthread.so.0
#11 0x0000003520cd1b6d in clone () from /lib64/libc.so.6

Backtrace for all threads will be attached

There was a bug#50364 which might be related, but it is closed by now, and there is no indication the bugfix was pushed into 5.1

How to repeat:
No repeatable test case yet, the crash happened in SysQA PB2 tests.
[14 May 2010 16:42] Elena Stepanova
all threads bt

Attachment: 53657_threads.out (application/octet-stream, text), 14.88 KiB.

[14 May 2010 16:42] Elena Stepanova
all threads bt full

Attachment: 53657_threads_full.out (application/octet-stream, text), 32.99 KiB.

[24 May 2010 16:08] Luis Soares
Analysis
========

  Looking at the core dumps, I find that two threads are racing
for the same mutex, one tries to lock it, after the other one has
destroyed it.

  First, inspecting thread #1 stack:

  #4  0x0000003520c30155 in raise () from /lib64/libc.so.6
  #5  0x0000003520c31bf0 in abort () from /lib64/libc.so.6
  #6  0x00000000009f29b4 in safe_mutex_lock (mp=0x20a0ba70, try_lock=0 '\0', file=0xbcef2e "mf_iocache.c", line=1722) at thr_mutex.c:156
  #7  0x00000000009c99a8 in my_b_flush_io_cache (info=0x20a0ba10, need_append_buffer_lock=1) at mf_iocache.c:1722
  #8  0x00000000008155ac in flush_master_info (mi=0x20a09e40, flush_relay_log_cache=true) at rpl_mi.cc:361
  #9  0x00000000008077c5 in handle_slave_io (arg=0x20a09e40) at slave.cc:2734
  #10 0x00000035218062f7 in start_thread () from /lib64/libpthread.so.0

  we find that it aborts in (mysys/thr_mutex.c:156):

  if (error || (error=pthread_mutex_lock(&mp->global)))
  {
    fprintf(stderr,"Got error %d when trying to lock mutex at %s, line %d\n",
            error, file, line);
    fflush(stderr);
    abort();
  }

  The actual error code in safe_mutex_lock is 22 which is
  EINVAL:

  (gdb) p error
  $15 = 22

  Inspecting stack from thread #2, 

  #0  0x0000003520ccb417 in fdatasync () from /lib64/libc.so.6
  #1  0x00000000009db92c in my_sync (fd=40, my_flags=16) at my_sync.c:62
  #2  0x000000000071d615 in MYSQL_LOG::close (this=0x20a0b6c8, exiting=3) at log.cc:2050
  #3  0x000000000071d924 in MYSQL_BIN_LOG::close (this=0x20a0b6c0, exiting=3) at log.cc:4897
  #4  0x0000000000722dc7 in MYSQL_BIN_LOG::new_file_impl (this=0x20a0b6c0, need_lock=true) at log.cc:3814
  #5  0x0000000000724848 in MYSQL_BIN_LOG::new_file (this=0x20a0b6c0) at log.cc:3723
  #6  0x00000000007fdb47 in rotate_relay_log (mi=0x20a09e40) at slave.cc:4443
  #7  0x0000000000650ae9 in reload_acl_and_cache (thd=0x20a22c60, options=2, tables=0x0, write_to_binlog=0x49192ecf) at sql_parse.cc:6865
  #8  0x0000000000661b13 in dispatch_command (command=COM_REFRESH, thd=0x20a22c60, packet=0x20b27211 "\002", packet_length=1) at sql_parse.cc:1476
  #9  0x00000000006625ed in do_command (thd=0x20a22c60) at sql_parse.cc:874
  #10 0x000000000064e943 in handle_one_connection (arg=0x20a22c60) at sql_connect.cc:1134
  #11 0x00000035218062f7 in start_thread () from /lib64/libpthread.so.0

  we find that this thread is actually doing an fdatasync at
  MYSQL_LOG::close(uint exiting). Looking inside that member
  function, we find that before calling my_sync, thread #2 calls
  end_io_cache (&log_file). Now, inspecting the body of
  end_io_cache, we find that it can destroy the mutex associated
  with the IO_CACHE (info->append_buffer_lock), see
  mysys/mf_iocache.c:1840:

    if (info->type == SEQ_READ_APPEND)
    {
      /* Destroy allocated mutex */
      info->type= TYPE_NOT_SET;
  #ifdef THREAD
      pthread_mutex_destroy(&info->append_buffer_lock);
  #endif
    }

   Looking at the crash we see that it happens in relay log's
   io_cache and searching the code we find that the cache is
   initialised with SEQ_READ_APPEND. So it all adds up and seems
   a race: one thread is destroyed a mutex and the other is
   trying to lock it before the mutex is reinitialised.

Conclusions
===========

  C1. Thread #1 aborts with error #22 which maps to EINVAL. From
      pthread_mutex_lock manpages we find that EINVAL can be
      returned in the case that: "The value specified by mutex
      does not refer to an initialized mutex object."

  C2. Thread #2, is actually doing a my_sync of the relay log, as
      consequence of a rotate operation. Furthermore, my_sync
      follows end_io_cache, which inside destroys the
      info->append_buffer_lock.

  C3. Thread #1, calls LOCK_APPEND_BUFFER macro, which is
      ultimately translated to:
      
      pthread_mutex_lock(&(info)->append_buffer_lock)

  C4. Now, Given C2 and C3, it is somewhat clear that C1 (error=
      EINVAL) is a consequence of a race between thread #1 and
      thread #2.

  C5. From C4, we can conclude that this is in fact a
      manifestation of BUG#50364 in the 5.1 codebase.
[24 May 2010 16:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/109081

3395 Luis Soares	2010-05-24
      BUG#53657: Slave crashed with error 22 when trying to lock mutex
                 at mf_iocache.c, line 1722
      
      The slave crashed while two threads: IO thread and user thread
      raced for the same mutex (the append_buffer_lock protecting the
      relay log's IO_CACHE). The IO thread was trying to flush the
      cache, and for that was grabbing the append_buffer_lock. 
      
      However, the other thread was closing and reopening the relay log
      when the IO thread tried to lock. Closing and reopening the log
      includes destroying and reinitialising the IO_CACHE
      mutex. Therefore, the IO thread tried to lock a destroyed mutex.
      
      We fix this by backporting patch for BUG#50364 which fixed this
      bug in mysql server 5.5+. The patch deploys missing
      synchronization when flush_master_info is called and the relay
      log is flushed by the IO thread. In detail the patch backports
      revision (from mysql-trunk):
      - luis.soares@sun.com-20100203165617-b1yydr0ee24ycpjm
      
      This patch already includes the post-push fix also in BUG#50364:
      - luis.soares@sun.com-20100222002629-0cijwqk6baxhj7gr
[28 Jun 2010 22:45] Luis Soares
Queued in mysql-5.1-bugteam and empty merged it to mysql-trunk-merge:
- http://pb2.norway.sun.com/web.py?template=push_details&push=1377797
- http://pb2.norway.sun.com/web.py?template=push_details&push=1377800
[19 Jul 2010 14:37] Bugs System
Pushed into 5.1.49 (revid:build@mysql.com-20100719143034-omcma40sblwmay3x) (version source revid:luis.soares@sun.com-20100628220900-yduk34aymksoboy1) (merge vers: 5.1.48) (pib:16)
[22 Jul 2010 13:25] Jon Stephens
See BUG#50364, the fix for which also fixes this bug. This appears to be a duplicate.
[22 Jul 2010 13:41] Paul Dubois
Noted in 5.1.49 changelog.

See Bug#50364, which fixes a similar problem but for 5.5+ rather than for 5.1.x.
[23 Jul 2010 12:23] Bugs System
Pushed into mysql-trunk 5.5.6-m3 (revid:alik@sun.com-20100723121820-jryu2fuw3pc53q9w) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)
[23 Jul 2010 12:30] Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100723121929-90e9zemk3jkr2ocy) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (pib:18)
[4 Aug 2010 12:24] Jon Stephens
Closing without further action per previous comment from Paul (this appears to be a null merge of the 5.1 version of the fix).
[14 Oct 2010 8:28] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:martin.skold@mysql.com-20101014082627-jrmy9xbfbtrebw3c) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:43] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:martin.skold@mysql.com-20101014083757-5qo48b86d69zjvzj) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:58] Bugs System
Pushed into mysql-5.1-telco-6.2 5.1.51-ndb-6.2.19 (revid:martin.skold@mysql.com-20101014084420-y54ecj85j5we27oa) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 12:31] Jon Stephens
No new changelog entry required. Setting back to Closed state.