The bug was updated successfully. The following people were notified: the MySQL developers, the bug reporter, the assigned developer, interested observers, and nobody else.
| Bug #90857 | Master dump thread stuck in 'Waiting to finalize termination' state | ||
|---|---|---|---|
| Submitted: | 14 May 2018 10:25 | Modified: | 5 Jun 2019 3:33 |
| Reporter: | Prasad N | Email Updates: | |
| Status: | Can't repeat | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S1 (Critical) |
| Version: | 5.7.21 | OS: | CentOS (CentOS (2.6.32-696.23.1.el6.x86_64 )) |
| Assigned to: | MySQL Verification Team | CPU Architecture: | x86 |
| Tags: | semi synchronous replication | ||
[14 May 2018 10:27]
Prasad N
Master status ------------- mysql> show variables like 'rpl%'; +-------------------------------------------+------------+ | Variable_name | Value | +-------------------------------------------+------------+ | rpl_semi_sync_master_enabled | ON | | rpl_semi_sync_master_timeout | 300000 | | rpl_semi_sync_master_trace_level | 32 | | rpl_semi_sync_master_wait_for_slave_count | 1 | | rpl_semi_sync_master_wait_no_slave | ON | | rpl_semi_sync_master_wait_point | AFTER_SYNC | | rpl_semi_sync_slave_enabled | OFF | | rpl_semi_sync_slave_trace_level | 32 | | rpl_stop_slave_timeout | 31536000 | +-------------------------------------------+------------+ 9 rows in set (0.00 sec) mysql> show status like 'rpl%'; +--------------------------------------------+---------+ | Variable_name | Value | +--------------------------------------------+---------+ | Rpl_semi_sync_master_clients | 2 | | Rpl_semi_sync_master_net_avg_wait_time | 0 | | Rpl_semi_sync_master_net_wait_time | 0 | | Rpl_semi_sync_master_net_waits | 6923 | | Rpl_semi_sync_master_no_times | 0 | | Rpl_semi_sync_master_no_tx | 0 | | Rpl_semi_sync_master_status | ON | | Rpl_semi_sync_master_timefunc_failures | 0 | | Rpl_semi_sync_master_tx_avg_wait_time | 2058 | | Rpl_semi_sync_master_tx_wait_time | 9300188 | | Rpl_semi_sync_master_tx_waits | 4519 | | Rpl_semi_sync_master_wait_pos_backtraverse | 0 | | Rpl_semi_sync_master_wait_sessions | 0 | | Rpl_semi_sync_master_yes_tx | 4560 | | Rpl_semi_sync_slave_status | OFF | +--------------------------------------------+---------+ 15 rows in set (0.00 sec) mysql> show master status; +------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | +------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ | mysql-bin.000002 | 2443391 | | | 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-8665, 770ca92e-5730-11e8-9f7a-000d3a064af1:1-2 | +------------------+----------+--------------+------------------+---------------------------------------------------------------------------------------+ 1 row in set (0.00 sec) mysql> show processlist; +------+-----------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ | Id | User | Host | db | Command | Time | State | Info | +------+-----------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ | 113 | repl_user | 172.19.0.12:58174 | NULL | Binlog Dump GTID | 3189 | Waiting to finalize termination | NULL | | 593 | repl_user | 172.19.0.11:45820 | NULL | Binlog Dump GTID | 2081 | Master has sent all binlog to slave; waiting for more updates | NULL | | 814 | root | 127.0.0.1:37274 | NULL | Sleep | 1599 | | NULL | | 863 | root | 127.0.0.1:37237 | NULL | Sleep | 1479 | | NULL | | 889 | root | 127.0.0.1:54642 | NULL | Sleep | 1419 | | NULL | | 974 | root | 127.0.0.1:40517 | NULL | Sleep | 1238 | | NULL | | 1029 | root | 127.0.0.1:56677 | NULL | Sleep | 1119 | | NULL | | 1079 | root | 127.0.0.1:50274 | NULL | Sleep | 1000 | | NULL | | 1161 | root | 127.0.0.1:60835 | NULL | Sleep | 821 | | NULL | | 1245 | root | 127.0.0.1:55190 | NULL | Sleep | 639 | | NULL | | 1267 | root | 127.0.0.1:48516 | NULL | Sleep | 580 | | NULL | | 1285 | root | 127.0.0.1:48367 | NULL | Sleep | 540 | | NULL | | 1315 | admin | 13.228.235.229:59667 | demo | Sleep | 0 | | NULL | | 1316 | admin | 13.228.235.229:59668 | demo | Sleep | 0 | | NULL | | 1317 | admin | 13.228.235.229:59670 | demo | Sleep | 0 | | NULL | | 1318 | admin | 13.228.235.229:59666 | demo | Sleep | 0 | | NULL | | 1319 | admin | 13.228.235.229:59669 | demo | Sleep | 0 | | NULL | | 1343 | repl_user | 172.19.0.12:35394 | NULL | Binlog Dump GTID | 418 | starting | NULL | | 1382 | repl_user | 172.19.0.12:35572 | NULL | Binlog Dump GTID | 328 | starting | NULL | | 1391 | root | 127.0.0.1:55518 | NULL | Query | 0 | starting | show processlist | | 1408 | repl_user | 172.19.0.12:35690 | NULL | Binlog Dump GTID | 268 | starting | NULL | | 1442 | repl_user | 172.19.0.12:35802 | NULL | Binlog Dump GTID | 189 | starting | NULL | | 1467 | repl_user | 172.19.0.12:35920 | NULL | Binlog Dump GTID | 128 | starting | NULL | | 1493 | repl_user | 172.19.0.12:36044 | NULL | Binlog Dump GTID | 68 | starting | NULL | | 1514 | repl_user | 172.19.0.12:36160 | NULL | Binlog Dump GTID | 7 | starting | NULL | +------+-----------+----------------------+------+------------------+------+---------------------------------------------------------------+------------------+ 25 rows in set (0.00 sec)
[14 May 2018 10:29]
Prasad N
Slave-1 details
----------------
mysql> show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.19.0.13
Master_User: repl_user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 1935024
Relay_Log_File: relay-log-slave.000002
Relay_Log_Pos: 1934069
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 1933896
Relay_Log_Space: 1935404
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 112
Master_UUID: 75aa0a2c-5730-11e8-9eb4-000d3a068798
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-6865
Executed_Gtid_Set: 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-6865,
770ca92e-5730-11e8-9f7a-000d3a064af1:1-2
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
ERROR:
No query specified
mysql> show variables like 'rpl%';
+-------------------------------------------+------------+
| Variable_name | Value |
+-------------------------------------------+------------+
| rpl_semi_sync_master_enabled | OFF |
| rpl_semi_sync_master_timeout | 300000 |
| rpl_semi_sync_master_trace_level | 32 |
| rpl_semi_sync_master_wait_for_slave_count | 1 |
| rpl_semi_sync_master_wait_no_slave | ON |
| rpl_semi_sync_master_wait_point | AFTER_SYNC |
| rpl_semi_sync_slave_enabled | ON |
| rpl_semi_sync_slave_trace_level | 32 |
| rpl_stop_slave_timeout | 31536000 |
+-------------------------------------------+------------+
9 rows in set (0.00 sec)
mysql> show status like 'rpl%';
+--------------------------------------------+-------+
| Variable_name | Value |
+--------------------------------------------+-------+
| Rpl_semi_sync_master_clients | 0 |
| Rpl_semi_sync_master_net_avg_wait_time | 0 |
| Rpl_semi_sync_master_net_wait_time | 0 |
| Rpl_semi_sync_master_net_waits | 0 |
| Rpl_semi_sync_master_no_times | 0 |
| Rpl_semi_sync_master_no_tx | 0 |
| Rpl_semi_sync_master_status | OFF |
| Rpl_semi_sync_master_timefunc_failures | 0 |
| Rpl_semi_sync_master_tx_avg_wait_time | 0 |
| Rpl_semi_sync_master_tx_wait_time | 0 |
| Rpl_semi_sync_master_tx_waits | 0 |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0 |
| Rpl_semi_sync_master_wait_sessions | 0 |
| Rpl_semi_sync_master_yes_tx | 0 |
| Rpl_semi_sync_slave_status | ON |
+--------------------------------------------+-------+
15 rows in set (0.00 sec)
[14 May 2018 10:29]
Prasad N
Slave-2 details --->> This is the slave that is stuck
mysql> show slave status \G;
*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: 172.19.0.13
Master_User: repl_user
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000002
Read_Master_Log_Pos: 816880
Relay_Log_File: relay-log-slave.000002
Relay_Log_Pos: 817053
Relay_Master_Log_File: mysql-bin.000002
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 816880
Relay_Log_Space: 817260
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 112
Master_UUID: 75aa0a2c-5730-11e8-9eb4-000d3a068798
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Slave has read all relay log; waiting for more updates
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-2898
Executed_Gtid_Set: 75aa0a2c-5730-11e8-9eb4-000d3a068798:1-2898,
770ca92e-5730-11e8-9f7a-000d3a064af1:1-2
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
1 row in set (0.00 sec)
ERROR:
No query specified
mysql> show status like 'rpl%';
+--------------------------------------------+-------+
| Variable_name | Value |
+--------------------------------------------+-------+
| Rpl_semi_sync_master_clients | 0 |
| Rpl_semi_sync_master_net_avg_wait_time | 0 |
| Rpl_semi_sync_master_net_wait_time | 0 |
| Rpl_semi_sync_master_net_waits | 0 |
| Rpl_semi_sync_master_no_times | 0 |
| Rpl_semi_sync_master_no_tx | 0 |
| Rpl_semi_sync_master_status | OFF |
| Rpl_semi_sync_master_timefunc_failures | 0 |
| Rpl_semi_sync_master_tx_avg_wait_time | 0 |
| Rpl_semi_sync_master_tx_wait_time | 0 |
| Rpl_semi_sync_master_tx_waits | 0 |
| Rpl_semi_sync_master_wait_pos_backtraverse | 0 |
| Rpl_semi_sync_master_wait_sessions | 0 |
| Rpl_semi_sync_master_yes_tx | 0 |
| Rpl_semi_sync_slave_status | ON |
+--------------------------------------------+-------+
15 rows in set (0.01 sec)
mysql> show varibles like 'rpl%';
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'varibles like 'rpl%'' at line 1
mysql> show variables like 'rpl%';
+-------------------------------------------+------------+
| Variable_name | Value |
+-------------------------------------------+------------+
| rpl_semi_sync_master_enabled | OFF |
| rpl_semi_sync_master_timeout | 300000 |
| rpl_semi_sync_master_trace_level | 32 |
| rpl_semi_sync_master_wait_for_slave_count | 1 |
| rpl_semi_sync_master_wait_no_slave | ON |
| rpl_semi_sync_master_wait_point | AFTER_SYNC |
| rpl_semi_sync_slave_enabled | ON |
| rpl_semi_sync_slave_trace_level | 32 |
| rpl_stop_slave_timeout | 31536000 |
+-------------------------------------------+------------+
9 rows in set (0.00 sec)
mysql>
[16 May 2018 5:31]
Prasad N
mysqld logs from maser
Attachment: master-mysqld.log (application/octet-stream, text), 35.62 KiB.
[16 May 2018 5:32]
Prasad N
mysqld logs from bad slave
Attachment: bad-slave-mysqld.log (application/octet-stream, text), 11.98 KiB.
[16 May 2018 5:32]
Prasad N
mysqld logs from good slave
Attachment: good-slave-mysqld.log (application/octet-stream, text), 26.70 KiB.
[17 May 2018 4:00]
Prasad N
Could this be please assigned to a developer for investigation. I have a system where I am consistently able to reproduce this, so if it is required to collect more info I can do that or try to provide access to the developer if possible.
[22 May 2018 13:01]
MySQL Verification Team
Hi, I cannot reproduce this, it appears to be related to Azure, seen similar issues reported that happen only on Azure. Are you able to reproduce this issue on any other platform (Oracle Cloud for e.g. or Proxmox/VmWare/VirtualBox or bare metal setup)? thanks Bogdan
[23 May 2018 1:37]
Prasad N
I have not tried this on other platforms as our test and prod platforms are azure. We have a replica set available where this is consistently being reproduced. We can provide access to that system. Could you directly debug on that system ?
[28 May 2018 5:01]
Prasad N
Hi - do you have further update on this ? Please let me know how you would like me to pass on the access credentials for the system where this is happening. Thanks! Prasad
[28 May 2018 16:21]
MySQL Verification Team
No further update yet. Azure is not among supported platforms. What binary are you actually using, the one from oracle.com or one from Microsoft? Did you try getting support from Microsoft, especially if you are using their version of mysql binaries that they changed so they can work on Azure? I cannot reproduce your issue using binaries supplied by oracle.com running on any of the supported platforms. all best Bogdan
[30 May 2018 16:07]
Prasad N
Hi - We are using the binaries from oracle.com Is there a pointer, link I can look up to understand the supported platforms for this? Thanks!
[15 Jun 2018 19:01]
Prasad N
Hi - Can you share an update on this ? Thanks!
[9 Aug 2018 17:28]
Prasad N
Hi - Any updates on this ? I am able to reproduce this on AWS VMs also . Thanks!
[10 Aug 2018 0:30]
MySQL Verification Team
I'm sorry but I cannot reproduce this. It is most probably related to your environment. best regards Bogdan
[10 Aug 2018 7:21]
Prasad N
Can you be more specific please ?
For example - the replication connection from the salve are stuck with the status as below when I do show processlist on the master
*************************** 42. row ***************************
Id: 4111
User: repl_user
Host: 172.31.9.126:39626
db: NULL
Command: Killed
Time: 1277
State: Waiting to finalize termination
Info: NULL
*************************** 43. row ***************************
Id: 4131
User: repl_user
Host: 172.31.9.126:39630
db: NULL
Command: Binlog Dump GTID
Time: 1217
State: Waiting to finalize termination
Info: NULL
*************************** 44. row ***************************
Id: 4151
User: repl_user
Host: 172.31.9.126:39634
db: NULL
Command: Binlog Dump GTID
Time: 1155
State: Waiting to finalize termination
Info: NULL
*************************** 45. row ***************************
These processes do not go away even if I kill them - as you can see for the id 4111 - command shows Killed, but the process do not go away.
Now the question is why does MySQL get into this kind of a dead-lock ? Can you investigate from the source code perspective and let me know what kind of scenarios or environments this can happen ?
Thanks in advance.
[10 Aug 2018 7:23]
Prasad N
We also have a live system where this can be consistently reproduced. We can provide the access to those systems or work with you for an online session to work together. Could you please let me know if this can be done ? Thanks!
[10 Aug 2018 7:41]
Prasad N
By the way, a similar issue is reported here as well. https://www.percona.com/forums/questions-discussions/mysql-and-percona-server/percona-serv...
[10 Aug 2018 11:59]
MySQL Verification Team
Hi, > We also have a live system where this can be consistently reproduced. I noticed a number of reports on Azure and AWS where this and similar bugs are reported but as I cannot reproduce this on a bare metal I cannot verify that it's a bug we are talking about. > We can provide the access to those systems or work with you for an online > session to work together. Could you please let me know if this can be done ? No this is against Oracle policy, we are not allowed to "mess around your machines". Also the linked question on percona system also shows that they are unable to reproduce the problem on bare metal too. Thanks Bogdan
[31 May 2019 14:44]
Frederic Descamps
Hi Prasad, I've checked with development and this is what's happening: (1) one of the dump threads begins to exit. (2) it gets stuck while exiting. (3) slave tries to reconnect, and the reconnect attempts pile up on master. (1) is certainly happening because there are lots of messages about error reading communication packets, and eventually net_flush failed. (3) happens because the new dump threads will wait for the old ones to finish. So the issue is in (2) that might happen if you freeze all the IO (full snapshot in your cloud env?). A stack dumps will be required here. If you could provide one that might help. Thank you.
[1 Jun 2019 18:10]
Prasad N
Hi Frederic Thanks a lot for looking into this. Your analysis about 1, 2 and 3 are fully right and matches with my observation. However, I am not doing any operations like full snapshot that freezes the IO. Additional observations are that the master quickly recovers from this deadlock once I stop the write activity on it. It also recovers if I stop the second semi-sync slave. The first slave immediately connects to the master and starts replicating. When I bring back the second slave, it runs into deadlock issue though. I will try and collect the stack trace for this condition and share soon. Thanks again.
[5 Jun 2019 3:30]
Prasad N
stack dump for master stuck in Waiting to finalize termination state
Attachment: master_stack_dump.log (application/octet-stream, text), 335.13 KiB.
[5 Jun 2019 3:33]
Prasad N
Hi Frederic - I have attached the stack dump for master when it is stuck in state 'Waiting to finalize termination': master_stack_dump.log Could you please take a look ? Thanks!

Description: master and 2 slaves setup in semi synchronous replication mode. There is minimal data load on the master with an application with 5 threads inserting values into a test table every 100 ms. After some time, the dump thread for one of the slaves enters 'Waiting to finalize termination' state and gets stuck. On the other side, the slave stops receiving any more data from the master and keeps retrying the connection after time out of 60 s. { slave_net_timeout=60] there are new dump threads getting created on the master for the slave, that are themselves stuck in 'Starting' state. This continues forever. How to repeat: master and 2 slaves setup with the following my.cnf. Only server_id differs on each node. [mysqld] port=3306 skip_name_resolve=1 bind_address=0.0.0.0 user=mysql pid_file=/var/run/mysqld/mysqld.pid socket=/var/lib/mysql/mysql.sock server_id=112 require_secure_transport=OFF log_bin=mysql-bin expire_logs_days=21 sync_binlog=1 innodb_log_files_in_group=2 innodb_flush_log_at_trx_commit=1 max_connect_errors=1000000 max_allowed_packet=16777216 max_heap_table_size=33554432 max_connections=60 max_user_connections=48 thread_cache_size=50 open_files_limit=65535 table_open_cache=2048 table_definition_cache=2048 relay_log=relay-log-slave gtid_mode=ON enforce_gtid_consistency=ON binlog_format=MIXED log_slave_updates=true slave_net_timeout=360 master_info_repository=TABLE relay_log_info_repository=TABLE sync_master_info=10000 sync_relay_log=10000 relay_log_recovery=1 slave_parallel_workers=30 slave_preserve_commit_order=1 slave_parallel_type=LOGICAL_CLOCK rpl_semi_sync_master_timeout=300000 relay_log_space_limit=0 max_relay_log_size=0 max_binlog_size=1073741824 datadir=/mysql_data/mysql general_log=ON general_log_file=/var/log/mysql/general.log log_error=/var/log/mysql/mysqld.log default_storage_engine=innodb innodb_flush_method=O_DIRECT innodb_file_per_table=ON innodb_log_file_size=134217728 innodb_buffer_pool_size=134217728 innodb_io_capacity=200 innodb_adaptive_hash_index=ON innodb_lock_wait_timeout=50 log_queries_not_using_indexes=OFF log_slow_admin_statements=OFF log_throttle_queries_not_using_indexes=0 long_query_time=10 slow_query_log=ON slow_query_log_file=/var/log/mysql/mysql-slowquery.log symbolic_links=0 interactive_timeout=28800 div_precision_increment=4 sql_mode="ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION" event_scheduler=OFF have an application with 5 writer threads inserting simple data into a test table every 100 ms. Hitting this issue very soon. Suggested fix: master should be continuously able to replicate to both the slaves.