The bug was updated successfully. The following people were notified: the MySQL developers, the bug reporter, the assigned developer, interested observers, and nobody else.
Bug #90857 | Master dump thread stuck in 'Waiting to finalize termination' state | ||
---|---|---|---|
Submitted: | 14 May 2018 10:25 | Modified: | 5 Jun 2019 3:33 |
Reporter: | Prasad N | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S1 (Critical) |
Version: | 5.7.21 | OS: | CentOS (CentOS (2.6.32-696.23.1.el6.x86_64 )) |
Assigned to: | MySQL Verification Team | CPU Architecture: | x86 |
Tags: | semi synchronous replication |
[14 May 2018 10:25]
Prasad N
[14 May 2018 10:27]
Prasad N
Master status ------------- mysql> show variables like 'rpl%'; +-------------------------------------------+------------+ | Variable_name | Value | +-------------------------------------------+------------+ | rpl_semi_sync_master_enabled | ON | | rpl_semi_sync_master_timeout | 300000 | | rpl_semi_sync_master_trace_level | 32 | | rpl_semi_sync_master_wait_for_slave_count | 1 | | rpl_semi_sync_master_wait_no_slave | ON | | rpl_semi_sync_master_wait_point | AFTER_SYNC | | rpl_semi_sync_slave_enabled | OFF | | rpl_semi_sync_slave_trace_level | 32 | | rpl_stop_slave_timeout | 31536000 | +-------------------------------------------+------------+ 9 rows in set (0.00 sec) mysql> show status like 'rpl%'; +--------------------------------------------+---------+ | Variable_name | Value | +--------------------------------------------+---------+ | Rpl_semi_sync_master_clients | 2 | | Rpl_semi_sync_master_net_avg_wait_time | 0 | | Rpl_semi_sync_master_net_wait_time | 0 | | Rpl_semi_sync_master_net_waits | 6923 | | Rpl_semi_sync_master_no_times | 0 | | Rpl_semi_sync_master_no_tx | 0 | | Rpl_semi_sync_master_status | ON | | Rpl_semi_sync_master_timefunc_failures | 0 | | Rpl_semi_sync_master_tx_avg_wait_time | 2058 | | Rpl_semi_sync_master_tx_wait_time | 9300188 | | Rpl_semi_sync_master_tx_waits | 4519 | | Rpl_semi_sync_master_wait_pos_backtraverse | 0 | | Rpl_semi_sync_master_wait_sessions | 0 | | Rpl_semi_sync_master_yes_tx | 4560 | | Rpl_semi_sync_slave_status | OFF | +--------------------------------------------+---------+ 15 rows in set (0.00 sec) mysql> show master status; +------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ | mysql-bin.000002 | 2443391 | | | 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-8665, 770ca92e-5730-11e8-9f7a-000d3a064af1:1-2 | +------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> show processlist; +------+-----------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +------+-----------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ | 113 | repl_user | 172.19.0.12:58174 | NULL | Binlog Dump GTID | 3189 | Waiting to finalize termination | NULL | | 593 | repl_user | 172.19.0.11:45820 | NULL | Binlog Dump GTID | 2081 | Master has sent all binlog to slave; waiting for more updates | NULL | | 814 | root | 127.0.0.1:37274 | NULL | Sleep | 1599 | | NULL | | 863 | root | 127.0.0.1:37237 | NULL | Sleep | 1479 | | NULL | | 889 | root | 127.0.0.1:54642 | NULL | Sleep | 1419 | | NULL | | 974 | root | 127.0.0.1:40517 | NULL | Sleep | 1238 | | NULL | | 1029 | root | 127.0.0.1:56677 | NULL | Sleep | 1119 | | NULL | | 1079 | root | 127.0.0.1:50274 | NULL | Sleep | 1000 | | NULL | | 1161 | root | 127.0.0.1:60835 | NULL | Sleep | 821 | | NULL | | 1245 | root | 127.0.0.1:55190 | NULL | Sleep | 639 | | NULL | | 1267 | root | 127.0.0.1:48516 | NULL | Sleep | 580 | | NULL | | 1285 | root | 127.0.0.1:48367 | NULL | Sleep | 540 | | NULL | | 1315 | admin | 13.228.235.229:59667 | demo | Sleep | 0 | | NULL | | 1316 | admin | 13.228.235.229:59668 | demo | Sleep | 0 | | NULL | | 1317 | admin | 13.228.235.229:59670 | demo | Sleep | 0 | | NULL | | 1318 | admin | 13.228.235.229:59666 | demo | Sleep | 0 | | NULL | | 1319 | admin | 13.228.235.229:59669 | demo | Sleep | 0 | | NULL | | 1343 | repl_user | 172.19.0.12:35394 | NULL | Binlog Dump GTID | 418 | starting | NULL | | 1382 | repl_user | 172.19.0.12:35572 | NULL | Binlog Dump GTID | 328 | starting | NULL | | 1391 | root | 127.0.0.1:55518 | NULL | Query | 0 | starting | show processlist | | 1408 | repl_user | 172.19.0.12:35690 | NULL | Binlog Dump GTID | 268 | starting | NULL | | 1442 | repl_user | 172.19.0.12:35802 | NULL | Binlog Dump GTID | 189 | starting | NULL | | 1467 | repl_user | 172.19.0.12:35920 | NULL | Binlog Dump GTID | 128 | starting | NULL | | 1493 | repl_user | 172.19.0.12:36044 | NULL | Binlog Dump GTID | 68 | starting | NULL | | 1514 | repl_user | 172.19.0.12:36160 | NULL | Binlog Dump GTID | 7 | starting | NULL | +------+-----------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ 25 rows in set (0.00 sec)
[14 May 2018 10:29]
Prasad N
Slave-1 details ---------------- mysql> show slave status \G; *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 172.19.0.13 Master_User: repl_user Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 1935024 Relay_Log_File: relay-log-slave.000002 Relay_Log_Pos: 1934069 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 1933896 Relay_Log_Space: 1935404 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 112 Master_UUID: 75aa0a2c-5730-11e8-9eb4-000d3a068798 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-6865 Executed_Gtid_Set: 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-6865, 770ca92e-5730-11e8-9f7a-000d3a064af1:1-2 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec) ERROR: No query specified mysql> show variables like 'rpl%'; +-------------------------------------------+------------+ | Variable_name | Value | +-------------------------------------------+------------+ | rpl_semi_sync_master_enabled | OFF | | rpl_semi_sync_master_timeout | 300000 | | rpl_semi_sync_master_trace_level | 32 | | rpl_semi_sync_master_wait_for_slave_count | 1 | | rpl_semi_sync_master_wait_no_slave | ON | | rpl_semi_sync_master_wait_point | AFTER_SYNC | | rpl_semi_sync_slave_enabled | ON | | rpl_semi_sync_slave_trace_level | 32 | | rpl_stop_slave_timeout | 31536000 | +-------------------------------------------+------------+ 9 rows in set (0.00 sec) mysql> show status like 'rpl%'; +--------------------------------------------+-------+ | Variable_name | Value | +--------------------------------------------+-------+ | Rpl_semi_sync_master_clients | 0 | | Rpl_semi_sync_master_net_avg_wait_time | 0 | | Rpl_semi_sync_master_net_wait_time | 0 | | Rpl_semi_sync_master_net_waits | 0 | | Rpl_semi_sync_master_no_times | 0 | | Rpl_semi_sync_master_no_tx | 0 | | Rpl_semi_sync_master_status | OFF | | Rpl_semi_sync_master_timefunc_failures | 0 | | Rpl_semi_sync_master_tx_avg_wait_time | 0 | | Rpl_semi_sync_master_tx_wait_time | 0 | | Rpl_semi_sync_master_tx_waits | 0 | | Rpl_semi_sync_master_wait_pos_backtraverse | 0 | | Rpl_semi_sync_master_wait_sessions | 0 | | Rpl_semi_sync_master_yes_tx | 0 | | Rpl_semi_sync_slave_status | ON | +--------------------------------------------+-------+ 15 rows in set (0.00 sec)
[14 May 2018 10:29]
Prasad N
Slave-2 details --->> This is the slave that is stuck mysql> show slave status \G; *************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: 172.19.0.13 Master_User: repl_user Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000002 Read_Master_Log_Pos: 816880 Relay_Log_File: relay-log-slave.000002 Relay_Log_Pos: 817053 Relay_Master_Log_File: mysql-bin.000002 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 816880 Relay_Log_Space: 817260 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: Master_SSL_CA_Path: Master_SSL_Cert: Master_SSL_Cipher: Master_SSL_Key: Seconds_Behind_Master: 0 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 112 Master_UUID: 75aa0a2c-5730-11e8-9eb4-000d3a068798 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-2898 Executed_Gtid_Set: 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-2898, 770ca92e-5730-11e8-9f7a-000d3a064af1:1-2 Auto_Position: 1 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec) ERROR: No query specified mysql> show status like 'rpl%'; +--------------------------------------------+-------+ | Variable_name | Value | +--------------------------------------------+-------+ | Rpl_semi_sync_master_clients | 0 | | Rpl_semi_sync_master_net_avg_wait_time | 0 | | Rpl_semi_sync_master_net_wait_time | 0 | | Rpl_semi_sync_master_net_waits | 0 | | Rpl_semi_sync_master_no_times | 0 | | Rpl_semi_sync_master_no_tx | 0 | | Rpl_semi_sync_master_status | OFF | | Rpl_semi_sync_master_timefunc_failures | 0 | | Rpl_semi_sync_master_tx_avg_wait_time | 0 | | Rpl_semi_sync_master_tx_wait_time | 0 | | Rpl_semi_sync_master_tx_waits | 0 | | Rpl_semi_sync_master_wait_pos_backtraverse | 0 | | Rpl_semi_sync_master_wait_sessions | 0 | | Rpl_semi_sync_master_yes_tx | 0 | | Rpl_semi_sync_slave_status | ON | +--------------------------------------------+-------+ 15 rows in set (0.01 sec) mysql> show varibles like 'rpl%'; ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'varibles like 'rpl%'' at line 1 mysql> show variables like 'rpl%'; +-------------------------------------------+------------+ | Variable_name | Value | +-------------------------------------------+------------+ | rpl_semi_sync_master_enabled | OFF | | rpl_semi_sync_master_timeout | 300000 | | rpl_semi_sync_master_trace_level | 32 | | rpl_semi_sync_master_wait_for_slave_count | 1 | | rpl_semi_sync_master_wait_no_slave | ON | | rpl_semi_sync_master_wait_point | AFTER_SYNC | | rpl_semi_sync_slave_enabled | ON | | rpl_semi_sync_slave_trace_level | 32 | | rpl_stop_slave_timeout | 31536000 | +-------------------------------------------+------------+ 9 rows in set (0.00 sec) mysql>
[16 May 2018 5:31]
Prasad N
mysqld logs from maser
Attachment: master-mysqld.log (application/octet-stream, text), 35.62 KiB.
[16 May 2018 5:32]
Prasad N
mysqld logs from bad slave
Attachment: bad-slave-mysqld.log (application/octet-stream, text), 11.98 KiB.
[16 May 2018 5:32]
Prasad N
mysqld logs from good slave
Attachment: good-slave-mysqld.log (application/octet-stream, text), 26.70 KiB.
[17 May 2018 4:00]
Prasad N
Could this be please assigned to a developer for investigation. I have a system where I am consistently able to reproduce this, so if it is required to collect more info I can do that or try to provide access to the developer if possible.
[22 May 2018 13:01]
MySQL Verification Team
Hi, I cannot reproduce this, it appears to be related to Azure, seen similar issues reported that happen only on Azure. Are you able to reproduce this issue on any other platform (Oracle Cloud for e.g. or Proxmox/VmWare/VirtualBox or bare metal setup)? thanks Bogdan
[23 May 2018 1:37]
Prasad N
I have not tried this on other platforms as our test and prod platforms are azure. We have a replica set available where this is consistently being reproduced. We can provide access to that system. Could you directly debug on that system ?
[28 May 2018 5:01]
Prasad N
Hi - do you have further update on this ? Please let me know how you would like me to pass on the access credentials for the system where this is happening. Thanks! Prasad
[28 May 2018 16:21]
MySQL Verification Team
No further update yet. Azure is not among supported platforms. What binary are you actually using, the one from oracle.com or one from Microsoft? Did you try getting support from Microsoft, especially if you are using their version of mysql binaries that they changed so they can work on Azure? I cannot reproduce your issue using binaries supplied by oracle.com running on any of the supported platforms. all best Bogdan
[30 May 2018 16:07]
Prasad N
Hi - We are using the binaries from oracle.com Is there a pointer, link I can look up to understand the supported platforms for this? Thanks!
[15 Jun 2018 19:01]
Prasad N
Hi - Can you share an update on this ? Thanks!
[9 Aug 2018 17:28]
Prasad N
Hi - Any updates on this ? I am able to reproduce this on AWS VMs also . Thanks!
[10 Aug 2018 0:30]
MySQL Verification Team
I'm sorry but I cannot reproduce this. It is most probably related to your environment. best regards Bogdan
[10 Aug 2018 7:21]
Prasad N
Can you be more specific please ? For example - the replication connection from the salve are stuck with the status as below when I do show processlist on the master *************************** 42. row *************************** Id: 4111 User: repl_user Host: 172.31.9.126:39626 db: NULL Command: Killed Time: 1277 State: Waiting to finalize termination Info: NULL *************************** 43. row *************************** Id: 4131 User: repl_user Host: 172.31.9.126:39630 db: NULL Command: Binlog Dump GTID Time: 1217 State: Waiting to finalize termination Info: NULL *************************** 44. row *************************** Id: 4151 User: repl_user Host: 172.31.9.126:39634 db: NULL Command: Binlog Dump GTID Time: 1155 State: Waiting to finalize termination Info: NULL *************************** 45. row *************************** These processes do not go away even if I kill them - as you can see for the id 4111 - command shows Killed, but the process do not go away. Now the question is why does MySQL get into this kind of a dead-lock ? Can you investigate from the source code perspective and let me know what kind of scenarios or environments this can happen ? Thanks in advance.
[10 Aug 2018 7:23]
Prasad N
We also have a live system where this can be consistently reproduced. We can provide the access to those systems or work with you for an online session to work together. Could you please let me know if this can be done ? Thanks!
[10 Aug 2018 7:41]
Prasad N
By the way, a similar issue is reported here as well. https://www.percona.com/forums/questions-discussions/mysql-and-percona-server/percona-serv...
[10 Aug 2018 11:59]
MySQL Verification Team
Hi, > We also have a live system where this can be consistently reproduced. I noticed a number of reports on Azure and AWS where this and similar bugs are reported but as I cannot reproduce this on a bare metal I cannot verify that it's a bug we are talking about. > We can provide the access to those systems or work with you for an online > session to work together. Could you please let me know if this can be done ? No this is against Oracle policy, we are not allowed to "mess around your machines". Also the linked question on percona system also shows that they are unable to reproduce the problem on bare metal too. Thanks Bogdan
[31 May 2019 14:44]
Frederic Descamps
Hi Prasad, I've checked with development and this is what's happening: (1) one of the dump threads begins to exit. (2) it gets stuck while exiting. (3) slave tries to reconnect, and the reconnect attempts pile up on master. (1) is certainly happening because there are lots of messages about error reading communication packets, and eventually net_flush failed. (3) happens because the new dump threads will wait for the old ones to finish. So the issue is in (2) that might happen if you freeze all the IO (full snapshot in your cloud env?). A stack dumps will be required here. If you could provide one that might help. Thank you.
[1 Jun 2019 18:10]
Prasad N
Hi Frederic Thanks a lot for looking into this. Your analysis about 1, 2 and 3 are fully right and matches with my observation. However, I am not doing any operations like full snapshot that freezes the IO. Additional observations are that the master quickly recovers from this deadlock once I stop the write activity on it. It also recovers if I stop the second semi-sync slave. The first slave immediately connects to the master and starts replicating. When I bring back the second slave, it runs into deadlock issue though. I will try and collect the stack trace for this condition and share soon. Thanks again.
[5 Jun 2019 3:30]
Prasad N
stack dump for master stuck in Waiting to finalize termination state
Attachment: master_stack_dump.log (application/octet-stream, text), 335.13 KiB.
[5 Jun 2019 3:33]
Prasad N
Hi Frederic - I have attached the stack dump for master when it is stuck in state 'Waiting to finalize termination': master_stack_dump.log Could you please take a look ? Thanks!