Bug #68460 | blocked with FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS | ||
---|---|---|---|
Submitted: | 22 Feb 2013 6:37 | Modified: | 31 May 2013 22:02 |
Reporter: | Queen Aaa | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
Version: | 5.6.10 | OS: | Linux (CentOS release 5.8 (Final)) |
Assigned to: | CPU Architecture: | Any |
[22 Feb 2013 6:37]
Queen Aaa
[22 Feb 2013 9:03]
Valeriy Kravchuk
This is easy to repeat with MTR using the following simple test: [openxs@chief mysql-test]$ cat t/bug68460.test --source include/master-slave.inc connection master; create table t(id int); --sync_slave_with_master connection slave; flush tables with read lock; connection master; insert into test.t set id=4; connection slave; show slave status; connection master; drop table t; --sync_slave_with_master This is what one gets when run it: ... 2013-02-22 10:54:20 2665 [Note] Shutting down plugin 'CSV' MySQL Version 5.6.10 Checking supported features... - SSL connections supported Collecting tests... Checking leftover processes... Removing old var directory... Creating var directory '/home/openxs/dbs/5.6/mysql-test/var'... Installing system database... Using server port 59406 ============================================================================== TEST RESULT TIME (ms) or COMMENT -------------------------------------------------------------------------- worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009 main.bug68460 [ fail ] Test ended at 2013-02-22 10:59:30 CURRENT_TEST: main.bug68460 === SHOW MASTER STATUS === ---- 1. ---- File slave-bin.000001 Position 216 Binlog_Do_DB Binlog_Ignore_DB Executed_Gtid_Set ========================== === SHOW SLAVE STATUS === ---- 1. ---- Slave_IO_State Waiting for master to send event Master_Host 127.0.0.1 Master_User root Master_Port 13000 Connect_Retry 1 Master_Log_File master-bin.000001 Read_Master_Log_Pos 590 Relay_Log_File slave-relay-bin.000002 Relay_Log_Pos 380 Relay_Master_Log_File master-bin.000001 Slave_IO_Running Yes Slave_SQL_Running Yes Replicate_Do_DB Replicate_Ignore_DB Replicate_Do_Table Replicate_Ignore_Table Replicate_Wild_Do_Table Replicate_Wild_Ignore_Table Last_Errno 0 Last_Error Skip_Counter 0 Exec_Master_Log_Pos 216 Relay_Log_Space 927 Until_Condition None Until_Log_File Until_Log_Pos 0 Master_SSL_Allowed No Master_SSL_CA_File Master_SSL_CA_Path Master_SSL_Cert Master_SSL_Cipher Master_SSL_Key Seconds_Behind_Master 300 Master_SSL_Verify_Server_Cert No Last_IO_Errno 0 Last_IO_Error Last_SQL_Errno 0 Last_SQL_Error Replicate_Ignore_Server_Ids Master_Server_Id 1 Master_UUID 76f0ac4e-7ccd-11e2-ab57-485b3952834f Master_Info_File /home/openxs/dbs/5.6/mysql-test/var/mysqld.2/data/master .info SQL_Delay 0 SQL_Remaining_Delay Slave_SQL_Running_State Waiting for global read lock Master_Retry_Count 10 Master_Bind Last_IO_Error_Timestamp Last_SQL_Error_Timestamp Master_SSL_Crl Master_SSL_Crlpath Retrieved_Gtid_Set Executed_Gtid_Set Auto_Position 0 ========================= === SHOW PROCESSLIST === ---- 1. ---- Id 2 User root Host localhost:49873 db test Command Sleep Time 300 State Info ---- 2. ---- Id 3 User root Host localhost:49874 db test Command Sleep Time 302 State Info ---- 3. ---- Id 4 User system user Host db Command Connect Time 301 State Waiting for master to send event Info ---- 4. ---- Id 5 User system user Host db test Command Connect Time 300 State Waiting for global read lock Info insert into test.t set id=4 ---- 5. ---- Id 6 User root Host localhost:49880 db test Command Query Time 0 State init Info SHOW PROCESSLIST ---- 6. ---- Id 7 User root Host localhost:49881 db test Command Sleep Time 300 State Info ======================== analyze: sync_with_master mysqltest: At line 20: sync_slave_with_master failed: 'select master_pos_wait('m aster-bin.000001', 590, 300)' returned -1 indicating timeout after 300 seconds ...
[22 Feb 2013 9:45]
MySQL Verification Team
Looks like http://bugs.mysql.com/bug.php?id=67879 fixed in 5.7+ ?
[22 Feb 2013 10:18]
Valeriy Kravchuk
Hm, my test case is probably wrong. It hangs upon sync with master because INSERT can not proceed. SHOW SLAVE STATUS does not hand. Please, send your my.cnf file content from slave.
[22 Feb 2013 11:23]
Valeriy Kravchuk
Now, proper test case. From what I see, we have to use tables to store replication info to repeat this. So, here is the test case: [openxs@chief mysql-test]$ cat t/bug68460.test --source include/master-slave.inc connection master; create database test2; create table test.t(id int) engine=InnoDB; create table test2.t(id int) engine=InnoDB; --sync_slave_with_master connection slave; stop slave; set global slave_parallel_workers=2; set global master_info_repository='TABLE'; set global sync_master_info=1; start slave; flush tables with read lock; connection master; insert into test.t set id=4; insert into test2.t set id=3; connection slave; show slave status; show processlist; This is how to run it: ./mtr --mysqld=--relay_log_info_repository=TABLE bug68460 Then just connect to hanging slave and check processlist: [openxs@chief ~]$ mysql -uroot --host=127.0.0.1 --port=13001 Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 12 Server version: 5.6.10-log MySQL Community Server (GPL) Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved. Oracle is a registered trademark of Oracle Corporation and/or its affiliates. Other names may be trademarks of their respective owners. Type 'help;' or '\h' for help. Type '\c' to clear the current input statement. mysql> show processlist; +----+-------------+-----------------+------+---------+------+---------------------------------------+-------------------+ | Id | User | Host | db | Command | Time | State | Info | +----+-------------+-----------------+------+---------+------+---------------------------------------+-------------------+ | 2 | root | localhost:51048 | test | Sleep | 322 | | NULL | | 3 | root | localhost:51049 | test | Sleep | 323 | | NULL | | 6 | root | localhost:51055 | test | Query | 320 | init | show slave status | | 7 | root | localhost:51056 | test | Sleep | 322 | | NULL | | 8 | system user | | NULL | Connect | 321 | Waiting for commit lock | NULL | | 9 | system user | | NULL | Connect | 321 | Reading event from the relay log | NULL | | 10 | system user | | NULL | Connect | 321 | Waiting for an event from Coordinator | NULL | | 11 | system user | | NULL | Connect | 321 | Waiting for an event from Coordinator | NULL | | 12 | root | localhost:51078 | NULL | Query | 0 | init | show processlist | +----+-------------+-----------------+------+---------+------+---------------------------------------+-------------------+ 9 rows in set (0.00 sec) mtr still hangs...
[22 Feb 2013 12:13]
MySQL Verification Team
Hello Queen Aaa, Thank you for the report. Verified as described on reported version. Thanks, Umesh
[22 Mar 2013 15:35]
Mirko Conte
Deadlock can be resolved doing "unlock tables" or closing session, but if a "flush logs" is done after "flush tables with read lock", the mysqld slave server process must be killed. **When doing a mysqldump with --single-transaction and --flush-logs options and some replication traffic, often the replication threads get locked, the server must be killed and as a result the slave instance misses one GTID after restart.**
[31 May 2013 22:02]
Daniel So
Added changelog entry to Reference Manual versions 5.6 and 5.7: 'After the client thread on a slave performed a FLUSH TABLES WITH READ LOCK and was followed by some updates on the master, the slave hung when executing SHOW SLAVE STATUS.'
[18 Jun 2013 19:51]
Sveta Smirnova
Bug #69493 was marked as duplicate of this one.
[18 Dec 2013 9:56]
Laurynas Biveinis
5.6$ bzr log -r 5158 ------------------------------------------------------------ revno: 5158 committer: Sujatha Sivakumar <sujatha.sivakumar@oracle.com> branch nick: Bug16387720_mysql-5.6 timestamp: Fri 2013-05-24 06:23:14 +0530 message: Bug#16387720: BLOCKED WITH FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS Problem: ======= If a client thread on a slave does FLUSH TABLES WITH READ LOCK; then master does some updates, SHOW SLAVE STATUS in the same client will be blocked. If the blocked slave server is killed and restarted when GTID's are enabled, one GTID gets missed leaving the slave in an out of sync state. Using 'relay-log-recovery=1' should do a crash safe slave recovery but it is not happening. Analysis: ======== RPL info tables "slave_master_info" and "slave_relay_log_info" are used to store slave's thread's positions. When a FTWRL is issued on slave this command blocks all operations on tables. Hence when some update operations are done on master the slave thread will be able to open the tables but while closing, an internal commit operation will be blocked. These blocked info tables make the "SHOW SLAVE STATUS" to hang. When "relay-log-recovery=1" during crash recovery all the partial written events will be discarded the master info table is initialised with the information read from relay log info table. The "Retrieved GTID" set should be cleared so that partial read events are discarded and they are fetched once again. Since this is not happening "The Retried GTID" is considered to be executed and the actual transaction is skipped. Fix: === Info tables should be made independent of global read lock. At the time of opening the RPL info tables the following "MYSQL_OPEN_IGNORE_GLOBAL_READ_LOCK" flag is set to make the info tables not to block when the FTWRL is in progress. Hence a similar flag is introduced in "ha_commit_trans" code which will allow commit to complete even if a global read lock is active. This flag can be used to allow changes to internal tables (e.g. slave status tables). To fix the missing GTID issue using "relay-log-recovery" option, during recovery process retrieved GTID set is cleared.
[10 Feb 2014 11:26]
James Heggs
Also be advised that the Percona hotbackup tool will issue a 'FLUSH TABLES....' if you do not specify the "--no-lock" option. Be very careful if you are using a tool, such as nagios to check your slave status whilst taking hot backups.
[14 Apr 2014 21:25]
Simon Mudd
Note: this bug is closed but it's not very clear form the reference to the bzr log number at what version this was actually fixed. If possible could you confirm in what versions of MySQL 5.6 this has been resolved? Thanks.
[14 Apr 2014 21:54]
MySQL Verification Team
Fixed in 5.6.13 and 5.7.2 http://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-13.html http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-2.html Replication: After the client thread on a slave performed a FLUSH TABLES WITH READ LOCK and was followed by some updates on the master, the slave hung when executing SHOW SLAVE STATUS. (Bug #68460, Bug #16387720)
[11 Sep 2014 10:33]
liu hickey
This bug still exists in 5.6.16+, but with different reason. It's repeatable in single slave sql thread, that is --slave_parallel_workers=0. The HANG case is: FTWRL, then SHOW SLAVE STATUS hang, waiting rli->data_lock worker thread hang, waiting MDL_COMMIT but holding rli->data_lock! 7431 int Xid_log_event::do_apply_event(Relay_log_info const *rli) 7442 mysql_mutex_lock(&rli_ptr->data_lock); 7495 error= do_commit(the); 7502 mysql_mutex_unlock(&rli_ptr->data_lock); Obviously, it's a serious bug, as doing backup on SLAVE is so normal.
[11 Sep 2014 10:35]
liu hickey
Backtrace: hread 27 (Thread 0x7f7198240700 (LWP 18877)): #0 0x00000038f0c0b7cb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0 #1 0x0000000000662250 in MDL_wait::timed_wait(MDL_context_owner*, timespec*, bool, PSI_stage_info_v1 const*) () at /u01/osconn/dev/alisql/include/mysql/psi/mysql_thread.h:1188 #2 0x00000000006640c2 in MDL_context::acquire_lock(MDL_request*, unsigned long) () at /u01/osconn/dev/alisql/sql/mdl.cc:2271 #3 0x00000000005b2fdf in ha_commit_trans(THD*, bool, bool) () at /u01/osconn/dev/alisql/sql/handler.cc:1474 #4 0x000000000078ce09 in trans_commit(THD*) () #5 0x000000000089111d in Xid_log_event::do_apply_event(Relay_log_info const*) () at /u01/osconn/dev/alisql/sql/log_event.cc:7344 #6 0x0000000000896bee in Log_event::apply_event(Relay_log_info*) () at /u01/osconn/dev/alisql/sql/log_event.cc:3031 #7 0x00000000008ceed9 in apply_event_and_update_pos(Log_event**, THD*, Relay_log_info*) () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:3502 #8 0x00000000008d1e45 in handle_slave_sql () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:3894 #9 0x0000000000b521c0 in pfs_spawn_thread () #10 0x00000038f0c077e1 in start_thread () from /lib64/libpthread.so.0 #11 0x00000038f08e68ed in clone () from /lib64/libc.so.6 Thread 25 (Thread 0x7f71981be700 (LWP 18892)): #0 0x00000038f0c0e074 in __lll_lock_wait () from /lib64/libpthread.so.0 #1 0x00000038f0c0937b in _L_lock_1023 () from /lib64/libpthread.so.0 #2 0x00000038f0c092fe in pthread_mutex_lock () from /lib64/libpthread.so.0 #3 0x00000000008c40dd in inline_mysql_mutex_lock.constprop.83 () at /u01/osconn/dev/alisql/include/mysql/psi/mysql_thread.h:673 #4 0x00000000008cda84 in show_slave_status(THD*, Master_info*) () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:2683 #5 0x00000000006fa63d in mysql_execute_command(THD*) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:3195 #6 0x00000000006fd498 in mysql_parse(THD*, char*, unsigned int, Parser_state*) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:6729 #7 0x00000000006ff7c0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:1634 #8 0x00000000006c391d in do_handle_one_connection(THD*) () #9 0x00000000006c3969 in handle_one_connection () #10 0x0000000000b521c0 in pfs_spawn_thread () #11 0x00000038f0c077e1 in start_thread () from /lib64/libpthread.so.0 #12 0x00000038f08e68ed in clone () from /lib64/libc.so.6
[11 Sep 2014 12:50]
Alexey Kopytov
http://bugs.mysql.com/bug.php?id=70307 (currently closed as Won't Fix).
[11 Sep 2014 15:34]
liu hickey
With clarified --slave_parallel_workers=0, bug70307 described another hang case, even without client2 in the given test case.
[3 Nov 2014 18:02]
monty solomon
We experience the slave hang frequently now -- already twice today. Please fix the problem in 5.6.
[19 Dec 2014 3:04]
monty solomon
We experience multiple failures due to this bug every day in version 5.6. The fix in Oracle bug 19843808 is only for version 5.7. This bug should be reopened as severity S1 since there is no available workaround for 5.6. Please backport the fix to version 5.6. Thanks.
[19 Dec 2014 4:59]
Sujatha Sivakumar
Oracle bug 19843808 is fixed in 5.6 as well. Fix will be available in upcoming 5.6 release