MySQL Bugs: #103189: A cleanly stopped replica still shows on the primary in SHOW SLAVE HOSTS.

Bug #103189	A cleanly stopped replica still shows on the primary in SHOW SLAVE HOSTS.
Submitted:	1 Apr 2021 20:27	Modified:	8 Apr 2021 15:01
Reporter:	Jean-François Gagné	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	8.0.23, 5.7.33	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

after cleanly stopping replication, the stopped replica still shows on the primary in SHOW SLAVE HOSTS (and SHOW PROCESSLIST).  I could understand that a zombie replica might stay in SHOW SLAVE HOSTS if it is stopped in an unclean way (kill -9), but for a cleanly stopped replica, I expect it to immediately disappear from SHOW SLAVE HOSTS (as shown in How to repeat, it will only disappear more than 30 seconds later).

Moreover, if the replica is restarted before it disappear from SHOW SLAVE HOSTS, MySQL 5.7 will have below in the error log (it is not in 8.0, and I guess this is due to change in default logging).

2021-04-01T19:57:33.928265Z 51 [Note] While initializing dump thread for slave with UUID <00020035-2222-2222-2222-222222222222>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(50).

This only happen if there is no activity on the primary.  If we have transactions being committed on the primary, the stopped replica will disappear from SHOW SLAVE HOSTS (I guess this is because a broken pipe will be handled by the primary).

I flagged this bug a S2 / Serious.  This is not me being capricious on April 1st: this behavior is the source of real problems.  As an example, tools that are using SHOW SLAVE HOSTS to discover replicas will think, for a few seconds after a replica is decommissioned, that there is a failed replica in the replication tree.  This combined with a Kubernetes environment where IP addresses are aggressively reused, such tool will "find" a replica that is inconsistent, which will cause noise and problems.

(I understand that a solution for that could be to not re-use IPs aggressively, and that tooling and automation should be resilient to this condition in the case of a replica being stopped in an unclean way, but it would be good to have this bug fixed for the 99% of the cases where noise could be avoided.)

Many thanks for looking into this,

Jean-François Gagné

How to repeat:
$ dbdeployer deploy replication mysql_8.0.23

$ date; ./m -t <<< "show slave hosts"; ./s1 <<< "stop slave"; sleep 1; ./m -t <<< "show slave hosts"
Thu Apr  1 19:48:34 UTC 2021
+-----------+--------+-------+-----------+--------------------------------------+
| Server_id | Host   | Port  | Master_id | Slave_UUID                           |
+-----------+--------+-------+-----------+--------------------------------------+
|       200 | node-2 | 21325 |       100 | 00021325-2222-2222-2222-222222222222 |
|       300 | node-3 | 21326 |       100 | 00021326-3333-3333-3333-333333333333 |
+-----------+--------+-------+-----------+--------------------------------------+
+-----------+--------+-------+-----------+--------------------------------------+
| Server_id | Host   | Port  | Master_id | Slave_UUID                           |
+-----------+--------+-------+-----------+--------------------------------------+
|       200 | node-2 | 21325 |       100 | 00021325-2222-2222-2222-222222222222 |
|       300 | node-3 | 21326 |       100 | 00021326-3333-3333-3333-333333333333 |
+-----------+--------+-------+-----------+--------------------------------------+

$ while test $(./m -t <<< "show slave hosts" | wc -l) -eq 6; do sleep 1; done; date
Thu Apr  1 19:49:24 UTC 2021

---

$ dbdeployer deploy replication mysql_5.7.33

$ date; ./m -t <<< "show slave hosts"; ./s1 <<< "stop slave"; sleep 1; ./m -t <<< "show slave hosts"
Thu Apr  1 19:54:55 UTC 2021
+-----------+--------+-------+-----------+--------------------------------------+
| Server_id | Host   | Port  | Master_id | Slave_UUID                           |
+-----------+--------+-------+-----------+--------------------------------------+
|       300 | node-3 | 20036 |       100 | 00020036-3333-3333-3333-333333333333 |
|       200 | node-2 | 20035 |       100 | 00020035-2222-2222-2222-222222222222 |
+-----------+--------+-------+-----------+--------------------------------------+
+-----------+--------+-------+-----------+--------------------------------------+
| Server_id | Host   | Port  | Master_id | Slave_UUID                           |
+-----------+--------+-------+-----------+--------------------------------------+
|       300 | node-3 | 20036 |       100 | 00020036-3333-3333-3333-333333333333 |
|       200 | node-2 | 20035 |       100 | 00020035-2222-2222-2222-222222222222 |
+-----------+--------+-------+-----------+--------------------------------------+

$ while test $(./m -t <<< "show slave hosts" | wc -l) -eq 6; do sleep 1; done; date
Thu Apr  1 19:55:39 UTC 2021

$ date; ./s1 <<< "start slave; stop slave"; sleep 1; ./s1 <<< "start slave"
Thu Apr  1 19:57:32 UTC 2021

We have this in the logs:

2021-04-01T19:57:33.928265Z 51 [Note] While initializing dump thread for slave with UUID <00020035-2222-2222-2222-222222222222>, found a zombie dump thread with the same UUID. Master is killing the zombie dump thread(50).

Hi,

I reproduced the problem but I am forwarding this to the replication team for the decision on how to move forward, I will verify it. I would not say this is a "bug" and especially not S2 but I do understand your POW so let's see what replication team will say about it.

Thanks for the report
Bogdan