Bug #53657 | Slave crashed with error 22 when trying to lock mutex at mf_iocache.c, line 1722 | ||
---|---|---|---|
Submitted: | 14 May 2010 16:32 | Modified: | 14 Oct 2010 12:31 |
Reporter: | Elena Stepanova | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.1.47-debug | OS: | Any |
Assigned to: | Luis Soares | CPU Architecture: | Any |
[14 May 2010 16:32]
Elena Stepanova
[14 May 2010 16:42]
Elena Stepanova
all threads bt
Attachment: 53657_threads.out (application/octet-stream, text), 14.88 KiB.
[14 May 2010 16:42]
Elena Stepanova
all threads bt full
Attachment: 53657_threads_full.out (application/octet-stream, text), 32.99 KiB.
[24 May 2010 16:08]
Luis Soares
Analysis ======== Looking at the core dumps, I find that two threads are racing for the same mutex, one tries to lock it, after the other one has destroyed it. First, inspecting thread #1 stack: #4 0x0000003520c30155 in raise () from /lib64/libc.so.6 #5 0x0000003520c31bf0 in abort () from /lib64/libc.so.6 #6 0x00000000009f29b4 in safe_mutex_lock (mp=0x20a0ba70, try_lock=0 '\0', file=0xbcef2e "mf_iocache.c", line=1722) at thr_mutex.c:156 #7 0x00000000009c99a8 in my_b_flush_io_cache (info=0x20a0ba10, need_append_buffer_lock=1) at mf_iocache.c:1722 #8 0x00000000008155ac in flush_master_info (mi=0x20a09e40, flush_relay_log_cache=true) at rpl_mi.cc:361 #9 0x00000000008077c5 in handle_slave_io (arg=0x20a09e40) at slave.cc:2734 #10 0x00000035218062f7 in start_thread () from /lib64/libpthread.so.0 we find that it aborts in (mysys/thr_mutex.c:156): if (error || (error=pthread_mutex_lock(&mp->global))) { fprintf(stderr,"Got error %d when trying to lock mutex at %s, line %d\n", error, file, line); fflush(stderr); abort(); } The actual error code in safe_mutex_lock is 22 which is EINVAL: (gdb) p error $15 = 22 Inspecting stack from thread #2, #0 0x0000003520ccb417 in fdatasync () from /lib64/libc.so.6 #1 0x00000000009db92c in my_sync (fd=40, my_flags=16) at my_sync.c:62 #2 0x000000000071d615 in MYSQL_LOG::close (this=0x20a0b6c8, exiting=3) at log.cc:2050 #3 0x000000000071d924 in MYSQL_BIN_LOG::close (this=0x20a0b6c0, exiting=3) at log.cc:4897 #4 0x0000000000722dc7 in MYSQL_BIN_LOG::new_file_impl (this=0x20a0b6c0, need_lock=true) at log.cc:3814 #5 0x0000000000724848 in MYSQL_BIN_LOG::new_file (this=0x20a0b6c0) at log.cc:3723 #6 0x00000000007fdb47 in rotate_relay_log (mi=0x20a09e40) at slave.cc:4443 #7 0x0000000000650ae9 in reload_acl_and_cache (thd=0x20a22c60, options=2, tables=0x0, write_to_binlog=0x49192ecf) at sql_parse.cc:6865 #8 0x0000000000661b13 in dispatch_command (command=COM_REFRESH, thd=0x20a22c60, packet=0x20b27211 "\002", packet_length=1) at sql_parse.cc:1476 #9 0x00000000006625ed in do_command (thd=0x20a22c60) at sql_parse.cc:874 #10 0x000000000064e943 in handle_one_connection (arg=0x20a22c60) at sql_connect.cc:1134 #11 0x00000035218062f7 in start_thread () from /lib64/libpthread.so.0 we find that this thread is actually doing an fdatasync at MYSQL_LOG::close(uint exiting). Looking inside that member function, we find that before calling my_sync, thread #2 calls end_io_cache (&log_file). Now, inspecting the body of end_io_cache, we find that it can destroy the mutex associated with the IO_CACHE (info->append_buffer_lock), see mysys/mf_iocache.c:1840: if (info->type == SEQ_READ_APPEND) { /* Destroy allocated mutex */ info->type= TYPE_NOT_SET; #ifdef THREAD pthread_mutex_destroy(&info->append_buffer_lock); #endif } Looking at the crash we see that it happens in relay log's io_cache and searching the code we find that the cache is initialised with SEQ_READ_APPEND. So it all adds up and seems a race: one thread is destroyed a mutex and the other is trying to lock it before the mutex is reinitialised. Conclusions =========== C1. Thread #1 aborts with error #22 which maps to EINVAL. From pthread_mutex_lock manpages we find that EINVAL can be returned in the case that: "The value specified by mutex does not refer to an initialized mutex object." C2. Thread #2, is actually doing a my_sync of the relay log, as consequence of a rotate operation. Furthermore, my_sync follows end_io_cache, which inside destroys the info->append_buffer_lock. C3. Thread #1, calls LOCK_APPEND_BUFFER macro, which is ultimately translated to: pthread_mutex_lock(&(info)->append_buffer_lock) C4. Now, Given C2 and C3, it is somewhat clear that C1 (error= EINVAL) is a consequence of a race between thread #1 and thread #2. C5. From C4, we can conclude that this is in fact a manifestation of BUG#50364 in the 5.1 codebase.
[24 May 2010 16:44]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/109081 3395 Luis Soares 2010-05-24 BUG#53657: Slave crashed with error 22 when trying to lock mutex at mf_iocache.c, line 1722 The slave crashed while two threads: IO thread and user thread raced for the same mutex (the append_buffer_lock protecting the relay log's IO_CACHE). The IO thread was trying to flush the cache, and for that was grabbing the append_buffer_lock. However, the other thread was closing and reopening the relay log when the IO thread tried to lock. Closing and reopening the log includes destroying and reinitialising the IO_CACHE mutex. Therefore, the IO thread tried to lock a destroyed mutex. We fix this by backporting patch for BUG#50364 which fixed this bug in mysql server 5.5+. The patch deploys missing synchronization when flush_master_info is called and the relay log is flushed by the IO thread. In detail the patch backports revision (from mysql-trunk): - luis.soares@sun.com-20100203165617-b1yydr0ee24ycpjm This patch already includes the post-push fix also in BUG#50364: - luis.soares@sun.com-20100222002629-0cijwqk6baxhj7gr
[28 Jun 2010 22:45]
Luis Soares
Queued in mysql-5.1-bugteam and empty merged it to mysql-trunk-merge: - http://pb2.norway.sun.com/web.py?template=push_details&push=1377797 - http://pb2.norway.sun.com/web.py?template=push_details&push=1377800
[19 Jul 2010 14:37]
Bugs System
Pushed into 5.1.49 (revid:build@mysql.com-20100719143034-omcma40sblwmay3x) (version source revid:luis.soares@sun.com-20100628220900-yduk34aymksoboy1) (merge vers: 5.1.48) (pib:16)
[22 Jul 2010 13:25]
Jon Stephens
See BUG#50364, the fix for which also fixes this bug. This appears to be a duplicate.
[22 Jul 2010 13:41]
Paul DuBois
Noted in 5.1.49 changelog. See Bug#50364, which fixes a similar problem but for 5.5+ rather than for 5.1.x.
[23 Jul 2010 12:23]
Bugs System
Pushed into mysql-trunk 5.5.6-m3 (revid:alik@sun.com-20100723121820-jryu2fuw3pc53q9w) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)
[23 Jul 2010 12:30]
Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100723121929-90e9zemk3jkr2ocy) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (pib:18)
[4 Aug 2010 12:24]
Jon Stephens
Closing without further action per previous comment from Paul (this appears to be a null merge of the 5.1 version of the fix).
[14 Oct 2010 8:28]
Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:martin.skold@mysql.com-20101014082627-jrmy9xbfbtrebw3c) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:43]
Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:martin.skold@mysql.com-20101014083757-5qo48b86d69zjvzj) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:58]
Bugs System
Pushed into mysql-5.1-telco-6.2 5.1.51-ndb-6.2.19 (revid:martin.skold@mysql.com-20101014084420-y54ecj85j5we27oa) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 12:31]
Jon Stephens
No new changelog entry required. Setting back to Closed state.