Bug #107008 Contribution by Tencent:Shutdown hang due to work thread wait for ack
Submitted: 13 Apr 2022 11:44 Modified: 5 May 2022 7:11
Reporter: Xiaodong Huang (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.37 OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution

[13 Apr 2022 11:44] Xiaodong Huang
Description:
After opening semi-sync, if rpl_semi_sync_master_wait_for_slave_count > 1 , rpl_semi_sync_master_wait_no_slave= on and rpl_semi_sync_master_timeout is set very long. Executing “shutdown” command will hang when Rpl_semi_sync_master_clients < rpl_semi_sync_master_wait_for_slave_count  - 1.

How to repeat:
We can reproduce the problem in 5.7.37 with following testcase.

The master opt file:

“

$SEMISYNC_PLUGIN_OPT
--force-restart

“

The slave opt file:

“

$SEMISYNC_PLUGIN_OPT

“

The cnf file:

“

!include ../my.cnf

[mysqld.1]
log-slave-updates
server-id= 1

[mysqld.2]
log-slave-updates
server-id= 2
slave_net_timeout=8

[mysqld.3]
log-slave-updates
server-id= 3
slave_net_timeout=8

[ENV]
SERVER_MYPORT_3=    @mysqld.3.port
SERVER_MYSOCK_3=    @mysqld.3.socket

”

The test file:

“

--source include/not_valgrind.inc
--source include/not_group_replication_plugin.inc
--source include/have_innodb.inc
--source include/have_binlog_format_row.inc

--echo #
--echo # prepare
--echo #

# server_1 is master, others are slaves.
--let $rpl_topology= 1->2, 1->3
--source include/rpl_init.inc

--let $rpl_connection_name= server_1
--source include/rpl_connection.inc
# Suppress warning:
# "Semi-sync master failed on net_flush() before waiting for slave reply"
CALL mtr.add_suppression("Semi-sync master failed on net_flush().*");
CALL mtr.add_suppression('Timeout waiting for reply of binlog');
CALL mtr.add_suppression('SEMISYNC: Forced shutdown. Some updates might not be replicated');
CALL mtr.add_suppression('of the transaction may not be synchronized to slave when server shutdown');

--echo # Enable semisync on master
--let $rpl_connection_name= server_1
--source include/rpl_connection.inc
--let $semisync_master_enabled= ON
--source include/install_semisync_master.inc
SET GLOBAL rpl_semi_sync_master_enabled= ON;

--echo # Change master variables
SET GLOBAL rpl_semi_sync_master_timeout= 20000000;
SET GLOBAL rpl_semi_sync_master_wait_no_slave= ON;
SET GLOBAL rpl_semi_sync_master_wait_for_slave_count = 2;
SET GLOBAL rpl_semi_sync_master_trace_level = 255;
--let $MYSQL_SOCKET= `SELECT @@socket`
--let $MYSQL_PORT= `SELECT @@port`

--echo # Start slave : server_2
--let $rpl_connection_name= server_2
--source include/rpl_connection.inc
--source include/install_semisync_slave.inc
SET GLOBAL rpl_semi_sync_slave_trace_level = 255;
show variables like "slave_net_timeout";

--echo # Start slave : server_3
--let $rpl_connection_name= server_3
--source include/rpl_connection.inc
--source include/install_semisync_slave.inc
SET GLOBAL rpl_semi_sync_slave_trace_level = 255;
show variables like "slave_net_timeout";

--let $rpl_connection_name= server_1
--source include/rpl_connection.inc

--source include/rpl_sync.inc
--echo # Verify ack_receiver thread is created
--let $assert_text= ack receiver thread is created;
--let $assert_cond= count(*) = 1 FROM performance_schema.threads WHERE name LIKE "%Ack_receiver"
--source include/assert.inc

CREATE TABLE t1(c1 INT);
--source include/rpl_sync.inc

--echo # Verify semisync replication works well
--let $rpl_connection_name= server_1
--source include/rpl_connection.inc
--source include/assert_semisync_master_status_on.inc

--echo #
--echo # Run
--echo #

--echo # shutdown slave : server_2
--let $rpl_connection_name= server_2
--source include/rpl_connection.inc
--source include/stop_slave_io.inc
--source include/wait_for_slave_io_to_stop.inc
--echo # Server2 io_thread is down

--echo # shutdown slave : server_3
--let $rpl_connection_name= server_3
--source include/rpl_connection.inc
--source include/stop_slave_io.inc
--source include/wait_for_slave_io_to_stop.inc
--echo # Server3 io_thread is down

--echo #
--echo # sleep 6s wait master send heartbeat 
--sleep 6

--echo # shutdown slave : server_1
--let $rpl_connection_name= server_1
--source include/rpl_connection.inc
--send insert into t1 values(1)
--sleep 1
--exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
--exec $MYSQLADMIN -uroot -S $MYSQL_SOCKET -P $MYSQL_PORT shutdown 2>&1
--echo # Server1 is down

--let $rpl_connection_name= server_1
--source include/rpl_connection.inc
--replace_regex /end_pos: [0-9]*\) /end_pos: <pos>) /  
--error 2013
--reap

--echo
--echo # restart server 1.
--let $rpl_server_number=1
--source include/rpl_start_server.inc
--let $rpl_connection_name= server_1
--source include/rpl_connection.inc

--echo # restart server 2 io_thread.
--let $rpl_connection_name= server_2
--source include/rpl_connection.inc
--source include/start_slave_io.inc

--echo # restart server 3 io_thread.
--let $rpl_connection_name= server_3
--source include/rpl_connection.inc
--source include/start_slave_io.inc

--echo #
--echo # Cleanup
--echo #
--let $rpl_connection_name= server_1
--source include/rpl_connection.inc
DROP TABLE t1;
--source include/rpl_sync.inc

--let $rpl_connection_name= server_2
--source include/rpl_connection.inc
--source include/uninstall_semisync_slave.inc

--let $rpl_connection_name= server_3
--source include/rpl_connection.inc
--source include/uninstall_semisync_slave.inc

--let $rpl_connection_name= server_1
--source include/rpl_connection.inc
--source include/uninstall_semisync_master.inc
--source include/rpl_end.inc

”

The reproduce step is as following:

1. Start a master with two slave with opening semi sync. set global slave_net_timeout=8 in slaves.
2. Execute "SET GLOBAL rpl_semi_sync_master_timeout= 20000000; SET GLOBAL rpl_semi_sync_master_wait_no_slave= ON; SET GLOBAL rpl_semi_sync_master_wait_for_slave_count = 2; SET GLOBAL rpl_semi_sync_master_trace_level = 255; “ in master.
3. Execute "CREATE TABLE t1(c1 INT);” in master to check the semi sync is work as expect.
4. Execute "stop slave io_thread; “  in both slaves;
5. Execute “insert into t1 values(1)” in master;
6. Execute “shutdown” ;
7. After 20 second, we will see the master process is still alive.

Suggested fix:
The bugfix patch is as follow:
[13 Apr 2022 11:46] Xiaodong Huang
bugfix patch

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: semisync_hang_bugfix.patch (application/octet-stream, text), 822 bytes.

[13 Apr 2022 11:56] MySQL Verification Team
Hello Xiaodong,

Thank you for the report and contribution.

regards,
Umesh
[5 May 2022 7:11] Xiaodong Huang
Fix typos and add title prefix "Contribution by Tencent"