MySQL Bugs: #104015: Loosing semi-sync acks could stall the primary.

Bug #104015	Loosing semi-sync acks could stall the primary.
Submitted:	15 Jun 2021 0:31	Modified:	21 Jun 2021 8:08
Reporter:	Jean-François Gagné	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	8.0.25, 5.7.34	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

when the semi-sync acks from replicas are lost (this could happen in a network partition "at the wrong moment"), transactions will block on the primary in the state "Waiting for semi-sync ACK from slave" (which is expected).  When the replicas reconnect, the transaction will stay blocked (which is not expected).  If we are "lucky", this will self-resolve; but combined with Bug#104012 (FLUSH BINARY LOGS blocks when a client is waiting for semi-sync ACK), this might cause a pile-up.  See "How to repeat" for details.

My guess for explaining why things stay blocked even after the reconnection is that the replicas are not acking the blocked transaction because they already had it from before, leaving the transaction blocked on the primary.  More details in "How to repeat".

Obviously, having a network partition exactly at the wrong moment is unlikely for a single replica-set.  But when running hundreds of replica-set spread over "network zones", a network partition will likely be at the wrong moment for some of the replica-sets.  This is what happened to me recently, and some (more than one, less than ten) of my replica-sets ended-up in a stalled situation (which I suspect is caused by this bug).

The way to unblock things when no FLUSH BINARY LOGS has blocked everything is to run a transaction on the primary, and the replica ack will unblock both previously blocked and new transactions.

The way to unblock things when a FLUSH BINARY LOGS has blocked everything is to disable semi-sync (like for Bug#104012).

By itself, this bug would be S3 (Non-critical), but combined with Bug#104012, it becomes more problematic on non busy systems where a FLUSH BINARY LOGS is run before another transaction, hence setting this as S2 (more than one of my systems ended-up in this state after a network problem).

Many thanks for looking into this,

Jean-François Gagné

How to repeat:
Note that the instructions below are for 8.0.25, but the same will happen for 5.7.34.

# Create a replication sandbox.
dbdeployer deploy replication mysql_8.0.25 --gtid --semi-sync

# Make the semi-sync timeout very large.
./m <<< "set global rpl_semi_sync_master_timeout = $((24*60*60*100))"

# Create a test schema and table.
./m <<< "create database test_jfg; create table test_jfg.t(id int not null auto_increment primary key, v int)"

# Take a GTID snapshot of the three nodes.
sql="select @@global.gtid_executed"; ./m -N <<< "$sql"; ./s1 -N <<< "$sql"; ./s2 -N <<< "$sql"
00021526-1111-1111-1111-111111111111:1-35
00021526-1111-1111-1111-111111111111:1-35
00021526-1111-1111-1111-111111111111:1-35

# Now we will generate a situation where we loose the semi-sync acks of a transaction reaching the replica.  This is a little complicated.

# Without disabling heartbeats, I am not able to reproduce.  I suspect they generate a different
#   code path, but I still think this is close to having a network partition "at the wrong moment".
sql="stop slave; change master to MASTER_HEARTBEAT_PERIOD = 0; start slave"; ./s1 <<< "$sql"; ./s2 <<< "$sql"

# Now prevent network packets from the replica to reaching the primary with still allowing packets from the primary to reaching the replicas.
iptables -A INPUT -p tcp --dport $(./m -N <<< "select @@port") -j DROP

# Run a transaction that will block (done in the background to not block the shell).
./m <<< "insert into test_jfg.t(v) values(0)" &

# Show that the transaction is blocked.
./m <<< "show processlist"
Id      User    Host    db      Command Time    State   Info
5       event_scheduler localhost       NULL    Daemon  249     Waiting on empty queue  NULL
18      rsandbox        localhost:38764 NULL    Binlog Dump GTID        65      Master has sent all binlog to slave; waiting for more updates   NULL
19      rsandbox        localhost:38766 NULL    Binlog Dump GTID        65      Master has sent all binlog to slave; waiting for more updates   NULL
21      msandbox        localhost       NULL    Query   38      Waiting for semi-sync ACK from slave    insert into test_jfg.t(v) values(0)
23      msandbox        localhost       NULL    Query   0       init    show processlist

# Take a new GTID snapshot of the three nodes, confirming that the transaction reached the replicas but that it did not fully commit on the primary.
sql="select @@global.gtid_executed"; ./m -N <<< "$sql"; ./s1 -N <<< "$sql"; ./s2 -N <<< "$sql"
00021526-1111-1111-1111-111111111111:1-35
00021526-1111-1111-1111-111111111111:1-36
00021526-1111-1111-1111-111111111111:1-36

# Waiting for both replicas to try reconnecting (because not receiving anything for slave_net_timeout, they will reconnect).
while true; do ./s1 <<< "show slave status\G" | grep -q "Slave_IO_Running: Connecting" && break; sleep 1; done
while true; do ./s2 <<< "show slave status\G" | grep -q "Slave_IO_Running: Connecting" && break; sleep 1; done

# Allow back network traffic from replicas to the primary.  Without a sleep, the stall will not happen.
#   I suspect this is needed to make sure TCP buffer gets fully flushed.
sleep 120; iptables -F INPUT

# Waiting for both replicas to reconnect.
while true; do ./s1 <<< "show slave status\G" | grep -q "Slave_IO_Running: Yes" && break; sleep 1; done
while true; do ./s2 <<< "show slave status\G" | grep -q "Slave_IO_Running: Yes" && break; sleep 1; done

# Even if both replicas reconnected, the transaction is still waiting for a semi-sync ack.
./m <<< "show processlist"
Id      User    Host    db      Command Time    State   Info
5       event_scheduler localhost       NULL    Daemon  675     Waiting on empty queue  NULL
21      msandbox        localhost       NULL    Query   464     Waiting for semi-sync ACK from slave    insert into test_jfg.t(v) values(0)
27      rsandbox        localhost:38832 NULL    Binlog Dump GTID        11      Master has sent all binlog to slave; waiting for more updates   NULL
28      rsandbox        localhost:38834 NULL    Binlog Dump GTID        11      Master has sent all binlog to slave; waiting for more updates   NULL
29      msandbox        localhost       NULL    Query   0       init    show processlist

# Here, we can unblock everything by running another transaction.
./m <<< "insert into test_jfg.t(v) values(0)"

# But if we do not run the transaction above, and run a FLUSH BINARY LOGS instead (Bug#104012), it will block (done in the background to not block the shell).
./m <<< "flush binary logs" &

# And we can confirm that things are blocked by listing sessions.
./m <<< "show processlist"
Id      User    Host    db      Command Time    State   Info
5       event_scheduler localhost       NULL    Daemon  467     Waiting on empty queue  NULL
21      msandbox        localhost       NULL    Query   230     Waiting for semi-sync ACK from slave    insert into test_jfg.t(v) values(0)
24      rsandbox        localhost:38880 NULL    Binlog Dump GTID        25      Master has sent all binlog to slave; waiting for more updates   NULL
25      rsandbox        localhost:38882 NULL    Binlog Dump GTID        24      Master has sent all binlog to slave; waiting for more updates   NULL
27      msandbox        localhost       NULL    Query   4       starting        flush binary logs
28      msandbox        localhost       NULL    Query   0       init    show processlist

# And now running a transaction will not unblock things and it will even increase the pile-up (done in the background to not block the shell).
./m <<< "insert into test_jfg.t(v) values(0)" &

# And we can confirm that things are blocked by listing sessions.
./m <<< "show processlist"
Id      User    Host    db      Command Time    State   Info
5       event_scheduler localhost       NULL    Daemon  487     Waiting on empty queue  NULL
21      msandbox        localhost       NULL    Query   250     Waiting for semi-sync ACK from slave    insert into test_jfg.t(v) values(0)
24      rsandbox        localhost:38880 NULL    Binlog Dump GTID        45      Master has sent all binlog to slave; waiting for more updates   NULL
25      rsandbox        localhost:38882 NULL    Binlog Dump GTID        44      Master has sent all binlog to slave; waiting for more updates   NULL
27      msandbox        localhost       NULL    Query   24      starting        flush binary logs
29      msandbox        localhost       NULL    Query   7       waiting for handler commit      insert into test_jfg.t(v) values(0)
30      msandbox        localhost       NULL    Query   0       init    show processlist

# A way to unblock things is to disable semi-sync:
./m <<< "set global rpl_semi_sync_master_enabled = off"; sleep 5; ./m <<< "show processlist"
[1]   Done                    ./m <<< "insert into test_jfg.t(v) values(0)"
[2]-  Done                    ./m <<< "flush binary logs"
[3]+  Done                    ./m <<< "insert into test_jfg.t(v) values(0)"
Id      User    Host    db      Command Time    State   Info
5       event_scheduler localhost       NULL    Daemon  592     Waiting on empty queue  NULL
33      rsandbox        localhost:38892 NULL    Binlog Dump GTID        29      Master has sent all binlog to slave; waiting for more updates   NULL
34      rsandbox        localhost:38894 NULL    Binlog Dump GTID        29      Master has sent all binlog to slave; waiting for more updates   NULL
36      msandbox        localhost       NULL    Query   0       init    show processlist

Suggested fix:
A reconnecting replica should implicitly ack everything before the position at which it is connecting.

Hello Jean-François,

Thank you for the report and test case.
Verified as described with 8.0.25 build.

regards,
Umesh

We've also run into this with Vitess when using semi-sync.

I *think* the general problem is that the source has a TCP socket open with the replica. It writes the ACK request message to the socket, then polls waiting the replica to write its response back to the socket. The replica, however, never read the ACK request message because it was restarting or had otherwise closed its end of the TCP socket. So when the replica re-connects it doesn't have the ACK request message to respond to, and the source is forever waiting for it. 

If that's all correct, then a potential solution is to have the semi-sync plugin on the source side write any sort of heartbeat/keep-alive packet on the TCP socket every N seconds that it's waiting for the ACK. It would then see that the other side of the socket is gone and could react accordingly.

Likely the same issue reported here: https://bugs.mysql.com/bug.php?id=101056