Description:
Hi,
when the semi-sync acks from replicas are lost (this could happen in a network partition "at the wrong moment"), transactions will block on the primary in the state "Waiting for semi-sync ACK from slave" (which is expected). When the replicas reconnect, the transaction will stay blocked (which is not expected). If we are "lucky", this will self-resolve; but combined with Bug#104012 (FLUSH BINARY LOGS blocks when a client is waiting for semi-sync ACK), this might cause a pile-up. See "How to repeat" for details.
My guess for explaining why things stay blocked even after the reconnection is that the replicas are not acking the blocked transaction because they already had it from before, leaving the transaction blocked on the primary. More details in "How to repeat".
Obviously, having a network partition exactly at the wrong moment is unlikely for a single replica-set. But when running hundreds of replica-set spread over "network zones", a network partition will likely be at the wrong moment for some of the replica-sets. This is what happened to me recently, and some (more than one, less than ten) of my replica-sets ended-up in a stalled situation (which I suspect is caused by this bug).
The way to unblock things when no FLUSH BINARY LOGS has blocked everything is to run a transaction on the primary, and the replica ack will unblock both previously blocked and new transactions.
The way to unblock things when a FLUSH BINARY LOGS has blocked everything is to disable semi-sync (like for Bug#104012).
By itself, this bug would be S3 (Non-critical), but combined with Bug#104012, it becomes more problematic on non busy systems where a FLUSH BINARY LOGS is run before another transaction, hence setting this as S2 (more than one of my systems ended-up in this state after a network problem).
Many thanks for looking into this,
Jean-François Gagné
How to repeat:
Note that the instructions below are for 8.0.25, but the same will happen for 5.7.34.
# Create a replication sandbox.
dbdeployer deploy replication mysql_8.0.25 --gtid --semi-sync
# Make the semi-sync timeout very large.
./m <<< "set global rpl_semi_sync_master_timeout = $((24*60*60*100))"
# Create a test schema and table.
./m <<< "create database test_jfg; create table test_jfg.t(id int not null auto_increment primary key, v int)"
# Take a GTID snapshot of the three nodes.
sql="select @@global.gtid_executed"; ./m -N <<< "$sql"; ./s1 -N <<< "$sql"; ./s2 -N <<< "$sql"
00021526-1111-1111-1111-111111111111:1-35
00021526-1111-1111-1111-111111111111:1-35
00021526-1111-1111-1111-111111111111:1-35
# Now we will generate a situation where we loose the semi-sync acks of a transaction reaching the replica. This is a little complicated.
# Without disabling heartbeats, I am not able to reproduce. I suspect they generate a different
# code path, but I still think this is close to having a network partition "at the wrong moment".
sql="stop slave; change master to MASTER_HEARTBEAT_PERIOD = 0; start slave"; ./s1 <<< "$sql"; ./s2 <<< "$sql"
# Now prevent network packets from the replica to reaching the primary with still allowing packets from the primary to reaching the replicas.
iptables -A INPUT -p tcp --dport $(./m -N <<< "select @@port") -j DROP
# Run a transaction that will block (done in the background to not block the shell).
./m <<< "insert into test_jfg.t(v) values(0)" &
# Show that the transaction is blocked.
./m <<< "show processlist"
Id User Host db Command Time State Info
5 event_scheduler localhost NULL Daemon 249 Waiting on empty queue NULL
18 rsandbox localhost:38764 NULL Binlog Dump GTID 65 Master has sent all binlog to slave; waiting for more updates NULL
19 rsandbox localhost:38766 NULL Binlog Dump GTID 65 Master has sent all binlog to slave; waiting for more updates NULL
21 msandbox localhost NULL Query 38 Waiting for semi-sync ACK from slave insert into test_jfg.t(v) values(0)
23 msandbox localhost NULL Query 0 init show processlist
# Take a new GTID snapshot of the three nodes, confirming that the transaction reached the replicas but that it did not fully commit on the primary.
sql="select @@global.gtid_executed"; ./m -N <<< "$sql"; ./s1 -N <<< "$sql"; ./s2 -N <<< "$sql"
00021526-1111-1111-1111-111111111111:1-35
00021526-1111-1111-1111-111111111111:1-36
00021526-1111-1111-1111-111111111111:1-36
# Waiting for both replicas to try reconnecting (because not receiving anything for slave_net_timeout, they will reconnect).
while true; do ./s1 <<< "show slave status\G" | grep -q "Slave_IO_Running: Connecting" && break; sleep 1; done
while true; do ./s2 <<< "show slave status\G" | grep -q "Slave_IO_Running: Connecting" && break; sleep 1; done
# Allow back network traffic from replicas to the primary. Without a sleep, the stall will not happen.
# I suspect this is needed to make sure TCP buffer gets fully flushed.
sleep 120; iptables -F INPUT
# Waiting for both replicas to reconnect.
while true; do ./s1 <<< "show slave status\G" | grep -q "Slave_IO_Running: Yes" && break; sleep 1; done
while true; do ./s2 <<< "show slave status\G" | grep -q "Slave_IO_Running: Yes" && break; sleep 1; done
# Even if both replicas reconnected, the transaction is still waiting for a semi-sync ack.
./m <<< "show processlist"
Id User Host db Command Time State Info
5 event_scheduler localhost NULL Daemon 675 Waiting on empty queue NULL
21 msandbox localhost NULL Query 464 Waiting for semi-sync ACK from slave insert into test_jfg.t(v) values(0)
27 rsandbox localhost:38832 NULL Binlog Dump GTID 11 Master has sent all binlog to slave; waiting for more updates NULL
28 rsandbox localhost:38834 NULL Binlog Dump GTID 11 Master has sent all binlog to slave; waiting for more updates NULL
29 msandbox localhost NULL Query 0 init show processlist
# Here, we can unblock everything by running another transaction.
./m <<< "insert into test_jfg.t(v) values(0)"
# But if we do not run the transaction above, and run a FLUSH BINARY LOGS instead (Bug#104012), it will block (done in the background to not block the shell).
./m <<< "flush binary logs" &
# And we can confirm that things are blocked by listing sessions.
./m <<< "show processlist"
Id User Host db Command Time State Info
5 event_scheduler localhost NULL Daemon 467 Waiting on empty queue NULL
21 msandbox localhost NULL Query 230 Waiting for semi-sync ACK from slave insert into test_jfg.t(v) values(0)
24 rsandbox localhost:38880 NULL Binlog Dump GTID 25 Master has sent all binlog to slave; waiting for more updates NULL
25 rsandbox localhost:38882 NULL Binlog Dump GTID 24 Master has sent all binlog to slave; waiting for more updates NULL
27 msandbox localhost NULL Query 4 starting flush binary logs
28 msandbox localhost NULL Query 0 init show processlist
# And now running a transaction will not unblock things and it will even increase the pile-up (done in the background to not block the shell).
./m <<< "insert into test_jfg.t(v) values(0)" &
# And we can confirm that things are blocked by listing sessions.
./m <<< "show processlist"
Id User Host db Command Time State Info
5 event_scheduler localhost NULL Daemon 487 Waiting on empty queue NULL
21 msandbox localhost NULL Query 250 Waiting for semi-sync ACK from slave insert into test_jfg.t(v) values(0)
24 rsandbox localhost:38880 NULL Binlog Dump GTID 45 Master has sent all binlog to slave; waiting for more updates NULL
25 rsandbox localhost:38882 NULL Binlog Dump GTID 44 Master has sent all binlog to slave; waiting for more updates NULL
27 msandbox localhost NULL Query 24 starting flush binary logs
29 msandbox localhost NULL Query 7 waiting for handler commit insert into test_jfg.t(v) values(0)
30 msandbox localhost NULL Query 0 init show processlist
# A way to unblock things is to disable semi-sync:
./m <<< "set global rpl_semi_sync_master_enabled = off"; sleep 5; ./m <<< "show processlist"
[1] Done ./m <<< "insert into test_jfg.t(v) values(0)"
[2]- Done ./m <<< "flush binary logs"
[3]+ Done ./m <<< "insert into test_jfg.t(v) values(0)"
Id User Host db Command Time State Info
5 event_scheduler localhost NULL Daemon 592 Waiting on empty queue NULL
33 rsandbox localhost:38892 NULL Binlog Dump GTID 29 Master has sent all binlog to slave; waiting for more updates NULL
34 rsandbox localhost:38894 NULL Binlog Dump GTID 29 Master has sent all binlog to slave; waiting for more updates NULL
36 msandbox localhost NULL Query 0 init show processlist
Suggested fix:
A reconnecting replica should implicitly ack everything before the position at which it is connecting.