MySQL Bugs: #68460: blocked with FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS

Bug #68460	blocked with FLUSH TABLES WITH READ LOCK + SHOW SLAVE STATUS
Submitted:	22 Feb 2013 6:37	Modified:	31 May 2013 22:02
Reporter:	Queen Aaa	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6.10	OS:	Linux (CentOS release 5.8 (Final))
Assigned to:		CPU Architecture:	Any

Description:
If a client thread on an slave does FLUSH TABLES WITH READ LOCK; 
then master does some updates,
SHOW SLAVE STATUS in the same client will be blocked.

mysql [localhost] {msandbox} ((none)) > show processlist;
+----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+-------------------+
| Id | User        | Host      | db   | Command | Time | State                                                                       | Info              |
+----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+-------------------+
|  1 | system user |           | NULL | Connect |   36 | Slave has read all relay log; waiting for the slave I/O thread to update it | NULL              |
|  2 | system user |           | NULL | Connect |  134 | Waiting for commit lock                                                     | NULL              |
|  3 | system user |           | NULL | Connect |  134 | Waiting for an event from Coordinator                                       | NULL              |
|  4 | system user |           | NULL | Connect |   46 | Waiting for an event from Coordinator                                       | NULL              |
|  5 | msandbox    | localhost | test | Query   |    2 | init                                                                        | show slave status |
|  6 | msandbox    | localhost | NULL | Query   |    0 | init                                                                        | show processlist  |
+----+-------------+-----------+------+---------+------+-----------------------------------------------------------------------------+-------------------+
6 rows in set (0.00 sec)
id:5

How to repeat:
step 1) setup replication

step 2) in the slave:
mysql [localhost] {msandbox} (test) > flush tables with read lock;
Query OK, 0 rows affected (0.01 sec)

step 3) in the master:
mysql [localhost] {msandbox} (test) > insert into test.t set id=4;
Query OK, 1 row affected (0.01 sec)

step 4) in the SAME session  of step 2：
mysql [localhost] {msandbox} (test) > show slave status;

it will be blocked.

Suggested fix:
don't know.
but it looks it bug 10942.

This is easy to repeat with MTR using the following simple test:

[openxs@chief mysql-test]$ cat t/bug68460.test
--source include/master-slave.inc

connection master;
create table t(id int);

--sync_slave_with_master

connection slave;
flush tables with read lock;

connection master;
insert into test.t set id=4;

connection slave;
show slave status;

connection master;
drop table t;

--sync_slave_with_master

This is what one gets when run it:

...
2013-02-22 10:54:20 2665 [Note] Shutting down plugin 'CSV'
MySQL Version 5.6.10
Checking supported features...
 - SSL connections supported
Collecting tests...
Checking leftover processes...
Removing old var directory...
Creating var directory '/home/openxs/dbs/5.6/mysql-test/var'...
Installing system database...
Using server port 59406

==============================================================================

TEST                                      RESULT   TIME (ms) or COMMENT
--------------------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
main.bug68460                            [ fail ]
        Test ended at 2013-02-22 10:59:30

CURRENT_TEST: main.bug68460
=== SHOW MASTER STATUS ===
---- 1. ----
File    slave-bin.000001
Position        216
Binlog_Do_DB
Binlog_Ignore_DB
Executed_Gtid_Set
==========================

=== SHOW SLAVE STATUS ===
---- 1. ----
Slave_IO_State  Waiting for master to send event
Master_Host     127.0.0.1
Master_User     root
Master_Port     13000
Connect_Retry   1
Master_Log_File master-bin.000001
Read_Master_Log_Pos     590
Relay_Log_File  slave-relay-bin.000002
Relay_Log_Pos   380
Relay_Master_Log_File   master-bin.000001
Slave_IO_Running        Yes
Slave_SQL_Running       Yes
Replicate_Do_DB
Replicate_Ignore_DB
Replicate_Do_Table
Replicate_Ignore_Table
Replicate_Wild_Do_Table
Replicate_Wild_Ignore_Table
Last_Errno      0
Last_Error
Skip_Counter    0
Exec_Master_Log_Pos     216
Relay_Log_Space 927
Until_Condition None
Until_Log_File
Until_Log_Pos   0
Master_SSL_Allowed      No
Master_SSL_CA_File
Master_SSL_CA_Path
Master_SSL_Cert
Master_SSL_Cipher
Master_SSL_Key
Seconds_Behind_Master   300
Master_SSL_Verify_Server_Cert   No
Last_IO_Errno   0
Last_IO_Error
Last_SQL_Errno  0
Last_SQL_Error
Replicate_Ignore_Server_Ids
Master_Server_Id        1
Master_UUID     76f0ac4e-7ccd-11e2-ab57-485b3952834f
Master_Info_File        /home/openxs/dbs/5.6/mysql-test/var/mysqld.2/data/master
.info
SQL_Delay       0
SQL_Remaining_Delay
Slave_SQL_Running_State Waiting for global read lock
Master_Retry_Count      10
Master_Bind
Last_IO_Error_Timestamp
Last_SQL_Error_Timestamp
Master_SSL_Crl
Master_SSL_Crlpath
Retrieved_Gtid_Set
Executed_Gtid_Set
Auto_Position   0
=========================

=== SHOW PROCESSLIST ===
---- 1. ----
Id      2
User    root
Host    localhost:49873
db      test
Command Sleep
Time    300
State
Info
---- 2. ----
Id      3
User    root
Host    localhost:49874
db      test
Command Sleep
Time    302
State
Info
---- 3. ----
Id      4
User    system user
Host
db
Command Connect
Time    301
State   Waiting for master to send event
Info
---- 4. ----
Id      5
User    system user
Host
db      test
Command Connect
Time    300
State   Waiting for global read lock
Info    insert into test.t set id=4
---- 5. ----
Id      6
User    root
Host    localhost:49880
db      test
Command Query
Time    0
State   init
Info    SHOW PROCESSLIST
---- 6. ----
Id      7
User    root
Host    localhost:49881
db      test
Command Sleep
Time    300
State
Info
========================

analyze: sync_with_master
mysqltest: At line 20: sync_slave_with_master failed: 'select master_pos_wait('m
aster-bin.000001', 590, 300)' returned -1 indicating timeout after 300 seconds
...

Looks like http://bugs.mysql.com/bug.php?id=67879 fixed in 5.7+ ?

Hm, my test case is probably wrong. It hangs upon sync with master because INSERT can not proceed. SHOW SLAVE STATUS does not hand.

Please, send your my.cnf file content from slave.

Now, proper test case. From what I see, we have to use tables to store replication info to repeat this. So, here is the test case:

[openxs@chief mysql-test]$ cat t/bug68460.test
--source include/master-slave.inc

connection master;
create database test2;
create table test.t(id int) engine=InnoDB;
create table test2.t(id int) engine=InnoDB;

--sync_slave_with_master

connection slave;
stop slave;
set global slave_parallel_workers=2;
set global master_info_repository='TABLE';
set global sync_master_info=1;
start slave;
flush tables with read lock;

connection master;
insert into test.t set id=4;
insert into test2.t set id=3;

connection slave;
show slave status;
show processlist;

This is how to run it:

./mtr  --mysqld=--relay_log_info_repository=TABLE bug68460

Then just connect to hanging slave and check processlist:

[openxs@chief ~]$ mysql -uroot --host=127.0.0.1 --port=13001
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 12
Server version: 5.6.10-log MySQL Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show processlist;
+----+-------------+-----------------+------+---------+------+---------------------------------------+-------------------+
| Id | User        | Host            | db   | Command | Time | State                                 | Info              |
+----+-------------+-----------------+------+---------+------+---------------------------------------+-------------------+
|  2 | root        | localhost:51048 | test | Sleep   |  322 |                                       | NULL              |
|  3 | root        | localhost:51049 | test | Sleep   |  323 |                                       | NULL              |
|  6 | root        | localhost:51055 | test | Query   |  320 | init                                  | show slave status |
|  7 | root        | localhost:51056 | test | Sleep   |  322 |                                       | NULL              |
|  8 | system user |                 | NULL | Connect |  321 | Waiting for commit lock               | NULL              |
|  9 | system user |                 | NULL | Connect |  321 | Reading event from the relay log      | NULL              |
| 10 | system user |                 | NULL | Connect |  321 | Waiting for an event from Coordinator | NULL              |
| 11 | system user |                 | NULL | Connect |  321 | Waiting for an event from Coordinator | NULL              |
| 12 | root        | localhost:51078 | NULL | Query   |    0 | init                                  | show processlist  |
+----+-------------+-----------------+------+---------+------+---------------------------------------+-------------------+
9 rows in set (0.00 sec)

mtr still hangs...

Hello Queen Aaa,

Thank you for the report.

Verified as described on reported version.

Thanks,
Umesh

Deadlock can be resolved doing "unlock tables" or closing session, but if a "flush logs" is done after "flush tables with read lock", the mysqld slave server process must be killed.

**When doing a mysqldump with --single-transaction and --flush-logs options and some replication traffic, often the replication threads get locked, the server must be killed and as a result the slave instance misses one GTID after restart.**

Added changelog entry to Reference Manual versions 5.6 and 5.7:

'After the client thread on a slave performed a FLUSH TABLES WITH READ LOCK and was followed by some updates on the master, the slave hung when executing SHOW SLAVE STATUS.'

Bug #69493 was marked as duplicate of this one.

5.6$ bzr log -r 5158
------------------------------------------------------------
revno: 5158
committer: Sujatha Sivakumar <sujatha.sivakumar@oracle.com>
branch nick: Bug16387720_mysql-5.6
timestamp: Fri 2013-05-24 06:23:14 +0530
message:
  Bug#16387720: BLOCKED WITH FLUSH TABLES WITH READ LOCK + 
  SHOW SLAVE STATUS
  
  Problem:
  =======
  If a client thread on a slave does FLUSH TABLES WITH READ
  LOCK; then master does some updates, SHOW SLAVE STATUS in
  the same client will be blocked. If the blocked slave server
  is killed and restarted when GTID's are enabled, one GTID
  gets missed leaving the slave in an out of sync state.
  
  Using 'relay-log-recovery=1' should do a crash safe slave
  recovery but it is not happening.
  
  Analysis:
  ========
  
  RPL info tables "slave_master_info" and "slave_relay_log_info"
  are used to store slave's thread's positions.  When a FTWRL
  is issued on slave this command blocks all operations on
  tables. Hence when some update operations are done on master
  the slave thread will be able to open the tables but while 
  closing, an internal commit operation will be blocked. These
  blocked info tables make the "SHOW SLAVE STATUS" to hang.
  
  When "relay-log-recovery=1" during crash recovery all the
  partial written events will be discarded the master info
  table is initialised with the information read from relay
  log info table. The "Retrieved GTID" set should be cleared
  so that partial read events are discarded and they are
  fetched once again. Since this is not happening
  "The Retried GTID" is considered to be executed and the 
  actual transaction is skipped.
  
  Fix:
  ===
  
  Info tables should be made independent of global read lock.
  At the time of opening the RPL info tables the following
  "MYSQL_OPEN_IGNORE_GLOBAL_READ_LOCK" flag is set to make
  the info tables not to block when the FTWRL is in progress.
  Hence a similar flag is introduced in "ha_commit_trans"
  code which will allow commit to complete even if a global
  read lock is active. This flag can be used to allow changes
  to internal tables (e.g. slave status tables).
  
  To fix the missing GTID issue using "relay-log-recovery"
  option, during recovery process retrieved GTID set is cleared.

Also be advised that the Percona hotbackup tool will issue a 'FLUSH TABLES....' if you do not specify the "--no-lock" option.

Be very careful if you are using a tool, such as nagios to check your slave status whilst taking hot backups.

Note: this bug is closed but it's not very clear form the reference to the bzr log number at what version this was actually fixed. If possible could you confirm in what versions of MySQL 5.6 this has been resolved?

Thanks.

Fixed in 5.6.13 and 5.7.2

http://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-13.html
http://dev.mysql.com/doc/relnotes/mysql/5.7/en/news-5-7-2.html

Replication: After the client thread on a slave performed a FLUSH TABLES WITH READ LOCK and was followed by some updates on the master, the slave hung when executing SHOW SLAVE STATUS. (Bug #68460, Bug #16387720)

This bug still exists in 5.6.16+, but with different reason.

It's repeatable in single slave sql thread, that is --slave_parallel_workers=0.

The HANG case is:

FTWRL, then SHOW SLAVE STATUS hang, waiting rli->data_lock

worker thread hang, waiting MDL_COMMIT but holding rli->data_lock!

 7431 int Xid_log_event::do_apply_event(Relay_log_info const *rli)
 7442   mysql_mutex_lock(&rli_ptr->data_lock);
 7495   error= do_commit(the);
 7502   mysql_mutex_unlock(&rli_ptr->data_lock);

Obviously, it's a serious bug, as doing backup on SLAVE is so normal.

Backtrace:

hread 27 (Thread 0x7f7198240700 (LWP 18877)):
#0  0x00000038f0c0b7cb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x0000000000662250 in MDL_wait::timed_wait(MDL_context_owner*, timespec*, bool, PSI_stage_info_v1 const*) ()
    at /u01/osconn/dev/alisql/include/mysql/psi/mysql_thread.h:1188
#2  0x00000000006640c2 in MDL_context::acquire_lock(MDL_request*, unsigned long) () at /u01/osconn/dev/alisql/sql/mdl.cc:2271
#3  0x00000000005b2fdf in ha_commit_trans(THD*, bool, bool) () at /u01/osconn/dev/alisql/sql/handler.cc:1474
#4  0x000000000078ce09 in trans_commit(THD*) ()
#5  0x000000000089111d in Xid_log_event::do_apply_event(Relay_log_info const*) () at /u01/osconn/dev/alisql/sql/log_event.cc:7344
#6  0x0000000000896bee in Log_event::apply_event(Relay_log_info*) () at /u01/osconn/dev/alisql/sql/log_event.cc:3031
#7  0x00000000008ceed9 in apply_event_and_update_pos(Log_event**, THD*, Relay_log_info*) () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:3502
#8  0x00000000008d1e45 in handle_slave_sql () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:3894
#9  0x0000000000b521c0 in pfs_spawn_thread ()
#10 0x00000038f0c077e1 in start_thread () from /lib64/libpthread.so.0
#11 0x00000038f08e68ed in clone () from /lib64/libc.so.6

Thread 25 (Thread 0x7f71981be700 (LWP 18892)):
#0  0x00000038f0c0e074 in __lll_lock_wait () from /lib64/libpthread.so.0
#1  0x00000038f0c0937b in _L_lock_1023 () from /lib64/libpthread.so.0
#2  0x00000038f0c092fe in pthread_mutex_lock () from /lib64/libpthread.so.0
#3  0x00000000008c40dd in inline_mysql_mutex_lock.constprop.83 () at /u01/osconn/dev/alisql/include/mysql/psi/mysql_thread.h:673
#4  0x00000000008cda84 in show_slave_status(THD*, Master_info*) () at /u01/osconn/dev/alisql/sql/rpl_slave.cc:2683
#5  0x00000000006fa63d in mysql_execute_command(THD*) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:3195
#6  0x00000000006fd498 in mysql_parse(THD*, char*, unsigned int, Parser_state*) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:6729
#7  0x00000000006ff7c0 in dispatch_command(enum_server_command, THD*, char*, unsigned int) () at /u01/osconn/dev/alisql/sql/sql_parse.cc:1634
#8  0x00000000006c391d in do_handle_one_connection(THD*) ()
#9  0x00000000006c3969 in handle_one_connection ()
#10 0x0000000000b521c0 in pfs_spawn_thread ()
#11 0x00000038f0c077e1 in start_thread () from /lib64/libpthread.so.0
#12 0x00000038f08e68ed in clone () from /lib64/libc.so.6

http://bugs.mysql.com/bug.php?id=70307 (currently closed as Won't Fix).

With clarified --slave_parallel_workers=0, bug70307 described another hang case, even without client2 in the given test case.

We experience the slave hang frequently now -- already twice today. Please fix the problem in 5.6.

We experience multiple failures due to this bug every day in version 5.6.

The fix in Oracle bug 19843808 is only for version 5.7.

This bug should be reopened as severity S1 since there is no available workaround for 5.6.

Please backport the fix to version 5.6.

Thanks.

Oracle bug 19843808  is fixed in 5.6 as well.  Fix will be available in upcoming 5.6 release