Bug #96033 Semisynchronuous replication over SSL and broken heartbeat.
Submitted: 29 Jun 2019 23:53 Modified: 7 Nov 2021 18:44
Reporter: Reio Remma (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.26 OS:CentOS
Assigned to: CPU Architecture:x86

[29 Jun 2019 23:53] Reio Remma
Description:
Master heartbeat doesn't work properly with semisynchronuous replication _with_ SSL and master to slave connections time out because of that.

It took me a week to narrow it down and eliminate network problems etc.

Observed with MySQL Community Server 5.7.26 from official MySQL repo on CentOS 7.

Error on the slave:

2019-06-29T23:18:55.225957Z 9606 [ERROR] Error reading packet from server for channel '': Lost connection to MySQL server during query (server_errno=2013)

Error on the master:

2019-06-29T23:18:55.225547Z 18428 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:18:55.225607Z 18428 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:18:55.225666Z 18428 [Note] Aborted connection 18428 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:18:55.570221Z 18434 [Note] Start binlog_dump to master_thread_id(18434) slave_server(2), pos(, 4)
2019-06-29T23:18:55.570279Z 18434 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)

As I understand the master is closing the connection when net_read_timeout is reached with no updates to the binary log in that time. MASTER_HEARTBEAT_PERIOD that is shorter than master net_read_timeout has no effect.

I observed heartbeats reaching the slave, but the connection is still closed when no binary log updates are transferred.

CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD = 10;

SELECT LAST_HEARTBEAT_TIMESTAMP FROM performance_schema.replication_connection_status;

2019-06-30 00:32:06
2019-06-30 00:32:16
2019-06-30 00:32:26

How to repeat:
Turn on semisynchronuous replication with SSL.

SET GLOBAL rpl_semi_sync_slave_enabled = ON;

CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD = 1;

CHANGE MASTER TO
       MASTER_SSL = 1,
       MASTER_SSL_CA = "/var/lib/mysql/ca.pem",
       MASTER_SSL_CERT = "/var/lib/mysql/client-cert.pem",
       MASTER_SSL_KEY = "/var/lib/mysql/client-key.pem",
       MASTER_SSL_VERIFY_SERVER_CERT = 1;

Set master net_read_timeout = 10 and a looping script that runs an update and sleeps for 10 seconds.

2019-06-29T23:06:35.532708Z 18168 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:06:35.532760Z 18168 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:06:35.532817Z 18168 [Note] Aborted connection 18168 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:06:35.837848Z 18173 [Note] Start binlog_dump to master_thread_id(18173) slave_server(2), pos(, 4)
2019-06-29T23:06:35.837940Z 18173 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
2019-06-29T23:06:45.986086Z 18173 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:06:45.986144Z 18173 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:06:45.986201Z 18173 [Note] Aborted connection 18173 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:06:46.476858Z 18175 [Note] Start binlog_dump to master_thread_id(18175) slave_server(2), pos(, 4)
2019-06-29T23:06:46.477229Z 18175 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
2019-06-29T23:06:56.668502Z 18175 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:06:56.668569Z 18175 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:06:56.668625Z 18175 [Note] Aborted connection 18175 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:06:56.979290Z 18178 [Note] Start binlog_dump to master_thread_id(18178) slave_server(2), pos(, 4)
2019-06-29T23:06:56.979352Z 18178 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)

---

If I add another slave to the mix with identical parameters, then they start to take turns - as if the timeout was multiplied by the number of slaves and/or updated on the wrong connection:

2019-06-29T23:13:33.271856Z 18339 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:13:33.271910Z 18339 [Note] Stop semi-sync binlog_dump to slave (server_id: 3)
2019-06-29T23:13:33.271965Z 18339 [Note] Aborted connection 18339 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error)
2019-06-29T23:13:33.694214Z 18343 [Note] Start binlog_dump to master_thread_id(18343) slave_server(3), pos(, 4)
2019-06-29T23:13:33.694267Z 18343 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4)
2019-06-29T23:13:43.295823Z 18341 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:13:43.295880Z 18341 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:13:43.295944Z 18341 [Note] Aborted connection 18341 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:13:43.625631Z 18345 [Note] Start binlog_dump to master_thread_id(18345) slave_server(2), pos(, 4)
2019-06-29T23:13:43.625695Z 18345 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
2019-06-29T23:13:53.325681Z 18343 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:13:53.325736Z 18343 [Note] Stop semi-sync binlog_dump to slave (server_id: 3)
2019-06-29T23:13:53.325794Z 18343 [Note] Aborted connection 18343 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error)
2019-06-29T23:13:53.721311Z 18349 [Note] Start binlog_dump to master_thread_id(18349) slave_server(3), pos(, 4)
2019-06-29T23:13:53.721370Z 18349 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4)
2019-06-29T23:14:03.345837Z 18345 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:14:03.345895Z 18345 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:14:03.345951Z 18345 [Note] Aborted connection 18345 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:14:03.678669Z 18353 [Note] Start binlog_dump to master_thread_id(18353) slave_server(2), pos(, 4)
2019-06-29T23:14:03.678726Z 18353 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)

---

Another experiment. If I change net_read_timeout to a longer period for one slave connection, neither will time out.

2019-06-29T23:16:03.648443Z 18382 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:16:03.648503Z 18382 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:16:03.648564Z 18382 [Note] Aborted connection 18382 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:16:03.968342Z 18389 [Note] Start binlog_dump to master_thread_id(18389) slave_server(2), pos(, 4)
2019-06-29T23:16:03.968400Z 18389 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
2019-06-29T23:16:13.677212Z 18384 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:16:13.677270Z 18384 [Note] Stop semi-sync binlog_dump to slave (server_id: 3)
2019-06-29T23:16:13.677338Z 18384 [Note] Aborted connection 18384 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error)
2019-06-29T23:16:14.097320Z 18391 [Note] Start binlog_dump to master_thread_id(18391) slave_server(3), pos(, 4)
2019-06-29T23:16:14.097378Z 18391 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4)

[SET GLOBAL net_read_timeout = 30;]

2019-06-29T23:16:23.698088Z 18389 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:16:23.698145Z 18389 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:16:23.698204Z 18389 [Note] Aborted connection 18389 to db: 'unconnected' user: 'replicant' host: 'backup.mrstuudio.ee' (Found net error)
2019-06-29T23:16:24.027270Z 18394 [Note] Start binlog_dump to master_thread_id(18394) slave_server(2), pos(, 4)
2019-06-29T23:16:24.027326Z 18394 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)

[SET GLOBAL net_read_timeout = 10;]

2019-06-29T23:16:33.728043Z 18391 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:16:33.728099Z 18391 [Note] Stop semi-sync binlog_dump to slave (server_id: 3)
2019-06-29T23:16:33.728157Z 18391 [Note] Aborted connection 18391 to db: 'unconnected' user: 'replicant' host: 'aws.mrstuudio.ee' (Found net error)
2019-06-29T23:16:34.373232Z 18397 [Note] Start binlog_dump to master_thread_id(18397) slave_server(3), pos(, 4)
2019-06-29T23:16:34.373284Z 18397 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4)

Now the errors stop for both connections.

Suggested fix:
Master heartbeat sent to slaves should restart the net_read_timeout counter on slave connections with SSL.
[30 Jun 2019 12:13] MySQL Verification Team
Hello Reio Remma,

Thank you for the report!

regards,
Umesh
[29 Oct 2019 11:33] Reio Remma
Hello! It seems that in v5.7.28 the issue is gone. I've been running over 24h with semisync without a single dropped connection.

I suspect the switch to OpenSSL in v5.7.28 has improved the situation.
[7 Nov 2021 18:44] Reio Remma
Fixed since v5.7.28