Bug #50364 | FLUSH LOGS crashes the server (rpl.rpl_heartbeat_basic fails in PB sporadically) | ||
---|---|---|---|
Submitted: | 15 Jan 2010 12:49 | Modified: | 14 Oct 2010 12:27 |
Reporter: | Alexander Nozdrin | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | Betony (M2), Celosia (M3) | OS: | Any |
Assigned to: | Luis Soares | CPU Architecture: | Any |
Tags: | pb2, sporadic, test failure |
[15 Jan 2010 12:49]
Alexander Nozdrin
[15 Jan 2010 12:50]
Alexander Nozdrin
Log extract
Attachment: log (application/octet-stream, text), 22.89 KiB.
[17 Jan 2010 7:33]
Sveta Smirnova
Thank you for the report. Set to "Verified" as this documented in pushbuild. Not repeatable on my test box.
[22 Jan 2010 10:09]
Luis Soares
I have come accross this stack trace while checking out this failure: Got error 22 when trying to lock mutex at mf_iocache.c, line 1724 [...] Thread 3 (process 8088): #0 0x0000003a232cb417 in fdatasync () from /lib64/libc.so.6 #1 0x0000000000f02f5b in my_sync (fd=31, my_flags=16) at my_sync.c:62 #2 0x00000000008bb5cf in MYSQL_BIN_LOG::sync_purge_index_file (this=0x7c5a7d0) at log.cc:3498 #3 0x00000000008c2b2c in MYSQL_BIN_LOG::open (this=0x7c5a7d0, log_name=0x2aaab0005008 "slave-relay-bin", log_type_arg=LOG_BIN, new_name=0x4203a000 "./slave-relay-bin.000052", io_cache_type_arg=SEQ_READ_APPEND, no_auto_events_arg=false, max_size_arg=1073741824, null_created_arg=true, need_mutex=false) at log.cc:2683 #4 0x00000000008c473d in MYSQL_BIN_LOG::new_file_impl (this=0x7c5a7d0, need_lock=true) at log.cc:3943 #5 0x00000000008c7270 in MYSQL_BIN_LOG::new_file (this=0x7c5a7d0) at log.cc:3834 #6 0x0000000000a2c20b in rotate_relay_log (mi=0x7c58f10) at slave.cc:4724 #7 0x0000000000758a35 in reload_acl_and_cache (thd=0x7c66b78, options=2, tables=0x0, write_to_binlog=0x4203afff) at sql_parse.cc:6872 #8 0x000000000076acac in mysql_execute_command (thd=0x7c66b78) at sql_parse.cc:3879 #9 0x0000000000770841 in mysql_parse (thd=0x7c66b78, inBuf=0x7ca77d8 "FLUSH LOGS", length=10, found_semicolon=0x4203beb0) at sql_parse.cc:5987 #10 0x0000000000772776 in dispatch_command (command=COM_QUERY, thd=0x7c66b78, packet=0x7c9f749 "FLUSH LOGS", packet_length=10) at sql_parse.cc:1140 #11 0x0000000000774811 in do_command (thd=0x7c66b78) at sql_parse.cc:810 #12 0x00000000007548dc in do_handle_one_connection (thd_arg=0x7c66b78) at sql_connect.cc:1172 #13 0x0000000000754a9b in handle_one_connection (arg=0x7c66b78) at sql_connect.cc:1112 #14 0x0000000000ecf4b6 in pfs_spawn_thread (arg=0x7c82be8) at pfs.cc:1011 #15 0x0000003a23e062f7 in start_thread () from /lib64/libpthread.so.0 #16 0x0000003a232d1b6d in clone () from /lib64/libc.so.6 [...] Thread 1 (process 8441): #0 0x0000003a23e0b122 in pthread_kill () from /lib64/libpthread.so.0 #1 0x0000000000f197c2 in my_write_core (sig=6) at stacktrace.c:326 #2 0x000000000074219e in handle_segfault (sig=6) at mysqld.cc:2712 #3 <signal handler called> #4 0x0000003a23230155 in raise () from /lib64/libc.so.6 #5 0x0000003a23231bf0 in abort () from /lib64/libc.so.6 #6 0x0000000000f2ed11 in safe_mutex_lock (mp=0x7c5ab88, try_lock=0 '\0', file=0x11c126e "mf_iocache.c", line=1724) at thr_mutex.c:159 #7 0x0000000000edd16f in inline_mysql_mutex_lock (that=0x7c5ab88, src_file=0x11c126e "mf_iocache.c", src_line=1724) at ../include/mysql/psi/mysql_thread.h:519 #8 0x0000000000ee00fd in my_b_flush_io_cache (info=0x7c5ab28, need_append_buffer_lock=1) at mf_iocache.c:1724 #9 0x0000000000a5734b in flush_master_info (mi=0x7c58f10, flush_relay_log_cache=true) at rpl_mi.cc:422 #10 0x0000000000a3e1fa in handle_slave_io (arg=0x7c58f10) at slave.cc:2895 #11 0x0000000000ecf4b6 in pfs_spawn_thread (arg=0x2aaab0002cd8) at pfs.cc:1011 #12 0x0000003a23e062f7 in start_thread () from /lib64/libpthread.so.0 #13 0x0000003a232d1b6d in clone () from /lib64/libc.so.6 Full stack trace at: http://tinyurl.com/y8mbwxk It seems that there is some concurrent access to the IO_CACHE and probably there should not. Looking at the source code, I find that: 1. Thread #3 closes and reopens the log file with LOCK_log taken. See log.cc:MYSQL_LOG_BIN::new_file_impl: open(old_name, log_type, new_name_ptr, io_cache_type, no_auto_events, max_size, 1, FALSE); 2. Thread #3, during closing and reopening of log file, it also closes and reopens the respective IO_CACHE. Eventually, the IO_CACHE mutex is re-initialized (a call to mysql_mutex_init is performed down the stack - in mysys/mf_iocache.c:init_io_cache): #ifdef THREAD mysql_mutex_init(key_IO_CACHE_append_buffer_lock, &info->append_buffer_lock, MY_MUTEX_INIT_FAST); #endif } #if defined(SAFE_MUTEX) && defined(THREAD) else { /* Clear mutex so that safe_mutex will notice that it's not initialized */ bzero((char*) &info->append_buffer_lock, sizeof(info)); } #endif 3. During/After #1 and #2, Thread #1 tries to flush master info. Now, seems it does so without grabbing a lock on LOCK_log. This results in a race between the threads, to the IO_CACHE. This happens when IO thread calls flush_master_info after it has queued an event in the relay log (see slave.cc:handle_slave_io): if (flush_master_info(mi, 1)) { sql_print_error("Failed to flush master info file"); goto err; } Inside flush_master_info, there is the following piece of code: if (flush_relay_log_cache) { IO_CACHE *log_file= mi->rli.relay_log.get_log_file(); if (flush_io_cache(log_file)) DBUG_RETURN(2); } Now, if doing this without the mutex taken, then log_file mutex can be silently re-initialized (for instance, by Thread #3) while this thread is using it (for instance, while calling safe_mutex_lock). Before the stack trace above there is a message output by the server: "Got error 22 when trying to lock mutex at mf_iocache.c, line 1724" Error 22 means: EINVAL 22 /* Invalid argument */ and for example, checking the manpages, for pthread_mutex_lock, EINVAL return value means: "EINVAL the mutex has not been properly initialized." So seems that this is indeed a race between the IO thread and the user thread that issues the FLUSH LOGS (and while rotating the relay log).
[22 Jan 2010 10:30]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/97836 2967 Luis Soares 2010-01-22 BUG#50364: FLUSH LOGS crashes the server (rpl.rpl_heartbeat_basic fails in PB sporadically) The IO thread can concurrently access the relay log IO_CACHE while another thread is performing an FLUSH LOGS procedure. FLUSH LOGS closes and reopens the relay log and while doing so it (re)initializes its IO_CACHE. During this procedure the IO_CACHE mutex is also reinitialized, which can cause problems if some other thread (namely the IO THREAD) is concurrently accessing it at the time . This patch fixes the problem by extending the interface of the flush_master_info function to also include a flag to state whether it should grab the LOCK_log or not before actually flushing the relay log. Also, IO thread now calls flush_master_info with the NEED_LOG_LOCK flag set when it flushes master info with FLUSH_RELAY_LOG_CACHE flag. @ sql/rpl_mi.cc Changed flush_master_info so that the flag NEED_LOG_LOCK is checked before actually flushing the relay log IO_CACHE. @ sql/rpl_mi.h Changed the interface of flush_master_info so that it takes an integer instead of a boolean because an extra flag was added: NEED_LOG_LOCK. This way, flags are packed in an integer, and can be decoded using bitwise operators. @ sql/slave.cc Made flush_master_info to be called with NEED_LOG_LOCK when IO Thread is flushing it without having the lock grabbed. @ sql/sql_repl.cc Changed flush_master_info call to have FLUSH_MI_EMPTY_FLAGS instead of 0 (it's the same thing, except that it is more compliant with the flags idea now).
[29 Jan 2010 0:12]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/98524 2968 Luis Soares 2010-01-29 BUG#50364: FLUSH LOGS crashes the server (rpl.rpl_heartbeat_basic fails in PB sporadically) Incremental commit addressing reviews comments: - Increased loop time in rpl_heartbeat_basic test case. - Reverted bitwise flags and added a second parameter instead: need_lock_relay_log. @ mysql-test/suite/rpl/t/rpl_heartbeat_basic.test Doubled the number of iterations on the FLUSH LOGS loop by doubling the time available to perform all iterations. @ sql/rpl_mi.cc Updating flush_master_info call so that it uses two parameters instead of the originally proposed encoded flags. @ sql/rpl_mi.h Changed flush_master_info interface. Now takes a second parameter instead of just one. The second parameter is: need_lock_relay_log. @ sql/rpl_rli.cc Small fix in comment. @ sql/slave.cc Updating flush_master_info call so that it uses two parameters instead of the originally proposed encoded flags. @ sql/sql_repl.cc Updating flush_master_info call so that it uses two parameters instead of the originally proposed encoded flags.
[3 Feb 2010 16:56]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/99121 2933 Luis Soares 2010-02-03 BUG#50364: FLUSH LOGS crashes the server (rpl.rpl_heartbeat_basic fails in PB sporadically) The IO thread can concurrently access the relay log IO_CACHE while another thread is performing an FLUSH LOGS procedure. FLUSH LOGS closes and reopens the relay log and while doing so it (re)initializes its IO_CACHE. During this procedure the IO_CACHE mutex is also reinitialized, which can cause problems if some other thread (namely the IO THREAD) is concurrently accessing it at the time . This patch fixes the problem by extending the interface of the flush_master_info function to also include a second paramater, "need_relay_log_lock", stating whether the thread should grab the relay log lock or not before actually flushing the relay log. Also, IO thread now calls flush_master_info with this flag set when it flushes master info with in the event read_event loop. Finally, we also increase loop time in rpl_heartbeat_basic test case, so that the number of calls to flush logs doubles, stressing this part of the code a little more. @ mysql-test/suite/rpl/t/rpl_heartbeat_basic.test Doubled the number of iterations on the FLUSH LOGS loop by doubling the time available to perform all iterations. @ sql/repl_failsafe.cc Updating flush_master_info call so that it uses two parameters instead of one. @ sql/rpl_mi.cc Updating flush_master_info call so that it uses two parameters instead of one. @ sql/rpl_mi.h Changed flush_master_info interface. Now takes a second parameter instead of just one. The second parameter is: need_lock_relay_log. @ sql/rpl_rli.cc Small fix in comment. @ sql/slave.cc Updating flush_master_info call so that it uses two parameters instead of one. @ sql/sql_repl.cc Updating flush_master_info call so that it uses two parameters instead of one.
[3 Feb 2010 17:21]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/99127 2981 Luis Soares 2010-02-03 [merge] BUG#50364: manual merge to mysql-next-mr-bugfixing. Conflicts ========= Text conflict in sql/repl_failsafe.cc Additional changes ================== Replace references to pthread_mutex with mysql_mutex
[12 Feb 2010 17:41]
Bugs System
Pushed into 5.5.2-m2 (revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (version source revid:joerg@mysql.com-20100212164100-jnurxdw5z88m472s) (merge vers: 5.5.2-m2) (pib:16)
[13 Feb 2010 8:36]
Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100213083436-9pesg4h55w1mekxc) (version source revid:luis.soares@sun.com-20100211135109-t63avry9fqpgyh78) (merge vers: 6.0.14-alpha) (pib:16)
[13 Feb 2010 8:38]
Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100213083327-cee4ao3jpg33eggv) (version source revid:luis.soares@sun.com-20100211135018-1f9dbghg0itszigo) (pib:16)
[13 Feb 2010 10:21]
Jon Stephens
Documented bugfix int he 5.5.2 and 6.0.14 changelogs as follows: FLUSH LOGS could in some circumstances crash the server. This occurred because the I/O thread could concurrently access the relay log I/O cache while another thread was performing the FLUSH LOGS, which closes and reopens the relay log and, while doing so, (re)initializes its I/O cache. This could cause problems if some other thread (in this case, the I/O thread) is accessing it at the same time. Now the thread performing the FLUSH LOGS takes a lock on the relay log before actually flushing it. Closed.
[22 Feb 2010 0:17]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/101011 2982 Luis Soares 2010-02-22 Post-push fix for BUG#50364. There was an erroneous parameter when calling flush_master_info from write_ignored_events_info_to_relay_log which could lead to a server crash. This happens because the I/O thread releases the log_lock before calling the flush_master_info. Set the function to call flush_master_info with third parameter to true, so that the mutex is properly taken.
[22 Feb 2010 0:20]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/101012 2974 Luis Soares 2010-02-22 [merge] BUG#50364: Automerge from mysql-trunk-bugfixing.
[22 Feb 2010 0:25]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/101013 2942 Luis Soares 2010-02-22 [merge] BUG#50364: Manual merge (empty) from mysql-next-mr-bugfixing. In 6.0 codebase, there was no issue.
[25 Feb 2010 19:47]
Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20100225194420-p60r4u90tszj8q2x) (version source revid:luis.soares@sun.com-20100222003533-27k9jj9100kc6atv) (merge vers: 6.0.14-alpha) (pib:16)
[25 Feb 2010 19:47]
Bugs System
Pushed into 5.5.3-m2 (revid:alik@sun.com-20100225194201-ybqxv77ofcxib3or) (version source revid:luis.soares@sun.com-20100222002629-0cijwqk6baxhj7gr) (merge vers: 5.5.3-m2) (pib:16)
[25 Feb 2010 19:49]
Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100225194305-h49uyjrlfl3mwo60) (version source revid:luis.soares@sun.com-20100222002746-a2t4cikxmml1h4cm) (pib:16)
[25 Feb 2010 20:51]
Paul DuBois
Noted in 5.5.3, 6.0.14 changelogs. On a replication slave, a race condition between the I/O thread and FLUSH LOGS could crash the server. Setting report to Need Merge pending push of Celosia to release tree.
[6 Mar 2010 10:51]
Bugs System
Pushed into 5.5.3-m3 (revid:alik@sun.com-20100306103849-hha31z2enhh7jwt3) (version source revid:alik@sun.com-20100225195857-farb6yvy8x06bylj) (merge vers: 5.5.99-m3) (pib:16)
[17 Mar 2010 11:52]
Luis Soares
See also: BUG#38240.
[19 Jul 2010 14:34]
Bugs System
Pushed into 5.1.49 (revid:build@mysql.com-20100719143034-omcma40sblwmay3x) (version source revid:luis.soares@sun.com-20100628220900-yduk34aymksoboy1) (merge vers: 5.1.48) (pib:16)
[20 Jul 2010 2:25]
Paul DuBois
Noted in 5.1.49 changelog.
[22 Jul 2010 13:40]
Paul DuBois
The 5.1.49 push message was spurious. This fix was not pushed to 5.1.49. See Bug#53657, which is a similar fix, but pushed to 5.1.49.
[23 Jul 2010 12:21]
Bugs System
Pushed into mysql-trunk 5.5.6-m3 (revid:alik@sun.com-20100723121820-jryu2fuw3pc53q9w) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:18)
[23 Jul 2010 12:29]
Bugs System
Pushed into mysql-next-mr (revid:alik@sun.com-20100723121929-90e9zemk3jkr2ocy) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (pib:18)
[3 Aug 2010 13:19]
Jon Stephens
Already documented for 5.5.3. Closed.
[14 Oct 2010 8:26]
Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:martin.skold@mysql.com-20101014082627-jrmy9xbfbtrebw3c) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:41]
Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:martin.skold@mysql.com-20101014083757-5qo48b86d69zjvzj) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 8:56]
Bugs System
Pushed into mysql-5.1-telco-6.2 5.1.51-ndb-6.2.19 (revid:martin.skold@mysql.com-20101014084420-y54ecj85j5we27oa) (version source revid:vasil.dimov@oracle.com-20100531152341-x2d4hma644icamh1) (merge vers: 5.5.5-m3) (pib:21)
[14 Oct 2010 12:27]
Jon Stephens
No new changelog entry required. Setting back to Closed state.