Description:
Master heartbeat doesn't work properly with semisynchronuous replication _with_ SSL and master to slave connections time out because of that.
It took me a week to narrow it down and eliminate network problems etc.
Observed with MySQL Community Server 5.7.26 from official MySQL repo on CentOS 7.
Error on the slave:
2019-06-29T23:18:55.225957Z 9606 [ERROR] Error reading packet from server for channel '': Lost connection to MySQL server during query (server_errno=2013)
Error on the master:
2019-06-29T23:18:55.225547Z 18428 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:18:55.225607Z 18428 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:18:55.225666Z 18428 [Note] Aborted connection 18428 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:18:55.570221Z 18434 [Note] Start binlog_dump to master_thread_id(18434) slave_server(2), pos(, 4)
2019-06-29T23:18:55.570279Z 18434 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
As I understand the master is closing the connection when net_read_timeout is reached with no updates to the binary log in that time. MASTER_HEARTBEAT_PERIOD that is shorter than master net_read_timeout has no effect.
I observed heartbeats reaching the slave, but the connection is still closed when no binary log updates are transferred.
CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD = 10;
SELECT LAST_HEARTBEAT_TIMESTAMP FROM performance_schema.replication_connection_status;
2019-06-30 00:32:06
2019-06-30 00:32:16
2019-06-30 00:32:26
How to repeat:
Turn on semisynchronuous replication with SSL.
SET GLOBAL rpl_semi_sync_slave_enabled = ON;
CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD = 1;
CHANGE MASTER TO
MASTER_SSL = 1,
MASTER_SSL_CA = "/var/lib/mysql/ca.pem",
MASTER_SSL_CERT = "/var/lib/mysql/client-cert.pem",
MASTER_SSL_KEY = "/var/lib/mysql/client-key.pem",
MASTER_SSL_VERIFY_SERVER_CERT = 1;
Set master net_read_timeout = 10 and a looping script that runs an update and sleeps for 10 seconds.
2019-06-29T23:06:35.532708Z 18168 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:06:35.532760Z 18168 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:06:35.532817Z 18168 [Note] Aborted connection 18168 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:06:35.837848Z 18173 [Note] Start binlog_dump to master_thread_id(18173) slave_server(2), pos(, 4)
2019-06-29T23:06:35.837940Z 18173 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
2019-06-29T23:06:45.986086Z 18173 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:06:45.986144Z 18173 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:06:45.986201Z 18173 [Note] Aborted connection 18173 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:06:46.476858Z 18175 [Note] Start binlog_dump to master_thread_id(18175) slave_server(2), pos(, 4)
2019-06-29T23:06:46.477229Z 18175 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
2019-06-29T23:06:56.668502Z 18175 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:06:56.668569Z 18175 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:06:56.668625Z 18175 [Note] Aborted connection 18175 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:06:56.979290Z 18178 [Note] Start binlog_dump to master_thread_id(18178) slave_server(2), pos(, 4)
2019-06-29T23:06:56.979352Z 18178 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
---
If I add another slave to the mix with identical parameters, then they start to take turns - as if the timeout was multiplied by the number of slaves and/or updated on the wrong connection:
2019-06-29T23:13:33.271856Z 18339 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:13:33.271910Z 18339 [Note] Stop semi-sync binlog_dump to slave (server_id: 3)
2019-06-29T23:13:33.271965Z 18339 [Note] Aborted connection 18339 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error)
2019-06-29T23:13:33.694214Z 18343 [Note] Start binlog_dump to master_thread_id(18343) slave_server(3), pos(, 4)
2019-06-29T23:13:33.694267Z 18343 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4)
2019-06-29T23:13:43.295823Z 18341 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:13:43.295880Z 18341 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:13:43.295944Z 18341 [Note] Aborted connection 18341 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:13:43.625631Z 18345 [Note] Start binlog_dump to master_thread_id(18345) slave_server(2), pos(, 4)
2019-06-29T23:13:43.625695Z 18345 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
2019-06-29T23:13:53.325681Z 18343 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:13:53.325736Z 18343 [Note] Stop semi-sync binlog_dump to slave (server_id: 3)
2019-06-29T23:13:53.325794Z 18343 [Note] Aborted connection 18343 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error)
2019-06-29T23:13:53.721311Z 18349 [Note] Start binlog_dump to master_thread_id(18349) slave_server(3), pos(, 4)
2019-06-29T23:13:53.721370Z 18349 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4)
2019-06-29T23:14:03.345837Z 18345 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:14:03.345895Z 18345 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:14:03.345951Z 18345 [Note] Aborted connection 18345 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:14:03.678669Z 18353 [Note] Start binlog_dump to master_thread_id(18353) slave_server(2), pos(, 4)
2019-06-29T23:14:03.678726Z 18353 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
---
Another experiment. If I change net_read_timeout to a longer period for one slave connection, neither will time out.
2019-06-29T23:16:03.648443Z 18382 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:16:03.648503Z 18382 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:16:03.648564Z 18382 [Note] Aborted connection 18382 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error)
2019-06-29T23:16:03.968342Z 18389 [Note] Start binlog_dump to master_thread_id(18389) slave_server(2), pos(, 4)
2019-06-29T23:16:03.968400Z 18389 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
2019-06-29T23:16:13.677212Z 18384 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:16:13.677270Z 18384 [Note] Stop semi-sync binlog_dump to slave (server_id: 3)
2019-06-29T23:16:13.677338Z 18384 [Note] Aborted connection 18384 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error)
2019-06-29T23:16:14.097320Z 18391 [Note] Start binlog_dump to master_thread_id(18391) slave_server(3), pos(, 4)
2019-06-29T23:16:14.097378Z 18391 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4)
[SET GLOBAL net_read_timeout = 30;]
2019-06-29T23:16:23.698088Z 18389 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:16:23.698145Z 18389 [Note] Stop semi-sync binlog_dump to slave (server_id: 2)
2019-06-29T23:16:23.698204Z 18389 [Note] Aborted connection 18389 to db: 'unconnected' user: 'replicant' host: 'backup.mrstuudio.ee' (Found net error)
2019-06-29T23:16:24.027270Z 18394 [Note] Start binlog_dump to master_thread_id(18394) slave_server(2), pos(, 4)
2019-06-29T23:16:24.027326Z 18394 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4)
[SET GLOBAL net_read_timeout = 10;]
2019-06-29T23:16:33.728043Z 18391 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply
2019-06-29T23:16:33.728099Z 18391 [Note] Stop semi-sync binlog_dump to slave (server_id: 3)
2019-06-29T23:16:33.728157Z 18391 [Note] Aborted connection 18391 to db: 'unconnected' user: 'replicant' host: 'aws.mrstuudio.ee' (Found net error)
2019-06-29T23:16:34.373232Z 18397 [Note] Start binlog_dump to master_thread_id(18397) slave_server(3), pos(, 4)
2019-06-29T23:16:34.373284Z 18397 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4)
Now the errors stop for both connections.
Suggested fix:
Master heartbeat sent to slaves should restart the net_read_timeout counter on slave connections with SSL.
Description: Master heartbeat doesn't work properly with semisynchronuous replication _with_ SSL and master to slave connections time out because of that. It took me a week to narrow it down and eliminate network problems etc. Observed with MySQL Community Server 5.7.26 from official MySQL repo on CentOS 7. Error on the slave: 2019-06-29T23:18:55.225957Z 9606 [ERROR] Error reading packet from server for channel '': Lost connection to MySQL server during query (server_errno=2013) Error on the master: 2019-06-29T23:18:55.225547Z 18428 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:18:55.225607Z 18428 [Note] Stop semi-sync binlog_dump to slave (server_id: 2) 2019-06-29T23:18:55.225666Z 18428 [Note] Aborted connection 18428 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error) 2019-06-29T23:18:55.570221Z 18434 [Note] Start binlog_dump to master_thread_id(18434) slave_server(2), pos(, 4) 2019-06-29T23:18:55.570279Z 18434 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4) As I understand the master is closing the connection when net_read_timeout is reached with no updates to the binary log in that time. MASTER_HEARTBEAT_PERIOD that is shorter than master net_read_timeout has no effect. I observed heartbeats reaching the slave, but the connection is still closed when no binary log updates are transferred. CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD = 10; SELECT LAST_HEARTBEAT_TIMESTAMP FROM performance_schema.replication_connection_status; 2019-06-30 00:32:06 2019-06-30 00:32:16 2019-06-30 00:32:26 How to repeat: Turn on semisynchronuous replication with SSL. SET GLOBAL rpl_semi_sync_slave_enabled = ON; CHANGE MASTER TO MASTER_HEARTBEAT_PERIOD = 1; CHANGE MASTER TO MASTER_SSL = 1, MASTER_SSL_CA = "/var/lib/mysql/ca.pem", MASTER_SSL_CERT = "/var/lib/mysql/client-cert.pem", MASTER_SSL_KEY = "/var/lib/mysql/client-key.pem", MASTER_SSL_VERIFY_SERVER_CERT = 1; Set master net_read_timeout = 10 and a looping script that runs an update and sleeps for 10 seconds. 2019-06-29T23:06:35.532708Z 18168 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:06:35.532760Z 18168 [Note] Stop semi-sync binlog_dump to slave (server_id: 2) 2019-06-29T23:06:35.532817Z 18168 [Note] Aborted connection 18168 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error) 2019-06-29T23:06:35.837848Z 18173 [Note] Start binlog_dump to master_thread_id(18173) slave_server(2), pos(, 4) 2019-06-29T23:06:35.837940Z 18173 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4) 2019-06-29T23:06:45.986086Z 18173 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:06:45.986144Z 18173 [Note] Stop semi-sync binlog_dump to slave (server_id: 2) 2019-06-29T23:06:45.986201Z 18173 [Note] Aborted connection 18173 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error) 2019-06-29T23:06:46.476858Z 18175 [Note] Start binlog_dump to master_thread_id(18175) slave_server(2), pos(, 4) 2019-06-29T23:06:46.477229Z 18175 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4) 2019-06-29T23:06:56.668502Z 18175 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:06:56.668569Z 18175 [Note] Stop semi-sync binlog_dump to slave (server_id: 2) 2019-06-29T23:06:56.668625Z 18175 [Note] Aborted connection 18175 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error) 2019-06-29T23:06:56.979290Z 18178 [Note] Start binlog_dump to master_thread_id(18178) slave_server(2), pos(, 4) 2019-06-29T23:06:56.979352Z 18178 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4) --- If I add another slave to the mix with identical parameters, then they start to take turns - as if the timeout was multiplied by the number of slaves and/or updated on the wrong connection: 2019-06-29T23:13:33.271856Z 18339 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:13:33.271910Z 18339 [Note] Stop semi-sync binlog_dump to slave (server_id: 3) 2019-06-29T23:13:33.271965Z 18339 [Note] Aborted connection 18339 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error) 2019-06-29T23:13:33.694214Z 18343 [Note] Start binlog_dump to master_thread_id(18343) slave_server(3), pos(, 4) 2019-06-29T23:13:33.694267Z 18343 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4) 2019-06-29T23:13:43.295823Z 18341 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:13:43.295880Z 18341 [Note] Stop semi-sync binlog_dump to slave (server_id: 2) 2019-06-29T23:13:43.295944Z 18341 [Note] Aborted connection 18341 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error) 2019-06-29T23:13:43.625631Z 18345 [Note] Start binlog_dump to master_thread_id(18345) slave_server(2), pos(, 4) 2019-06-29T23:13:43.625695Z 18345 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4) 2019-06-29T23:13:53.325681Z 18343 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:13:53.325736Z 18343 [Note] Stop semi-sync binlog_dump to slave (server_id: 3) 2019-06-29T23:13:53.325794Z 18343 [Note] Aborted connection 18343 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error) 2019-06-29T23:13:53.721311Z 18349 [Note] Start binlog_dump to master_thread_id(18349) slave_server(3), pos(, 4) 2019-06-29T23:13:53.721370Z 18349 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4) 2019-06-29T23:14:03.345837Z 18345 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:14:03.345895Z 18345 [Note] Stop semi-sync binlog_dump to slave (server_id: 2) 2019-06-29T23:14:03.345951Z 18345 [Note] Aborted connection 18345 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error) 2019-06-29T23:14:03.678669Z 18353 [Note] Start binlog_dump to master_thread_id(18353) slave_server(2), pos(, 4) 2019-06-29T23:14:03.678726Z 18353 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4) --- Another experiment. If I change net_read_timeout to a longer period for one slave connection, neither will time out. 2019-06-29T23:16:03.648443Z 18382 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:16:03.648503Z 18382 [Note] Stop semi-sync binlog_dump to slave (server_id: 2) 2019-06-29T23:16:03.648564Z 18382 [Note] Aborted connection 18382 to db: 'unconnected' user: 'replicant' host: 'backup' (Found net error) 2019-06-29T23:16:03.968342Z 18389 [Note] Start binlog_dump to master_thread_id(18389) slave_server(2), pos(, 4) 2019-06-29T23:16:03.968400Z 18389 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4) 2019-06-29T23:16:13.677212Z 18384 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:16:13.677270Z 18384 [Note] Stop semi-sync binlog_dump to slave (server_id: 3) 2019-06-29T23:16:13.677338Z 18384 [Note] Aborted connection 18384 to db: 'unconnected' user: 'replicant' host: 'aws' (Found net error) 2019-06-29T23:16:14.097320Z 18391 [Note] Start binlog_dump to master_thread_id(18391) slave_server(3), pos(, 4) 2019-06-29T23:16:14.097378Z 18391 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4) [SET GLOBAL net_read_timeout = 30;] 2019-06-29T23:16:23.698088Z 18389 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:16:23.698145Z 18389 [Note] Stop semi-sync binlog_dump to slave (server_id: 2) 2019-06-29T23:16:23.698204Z 18389 [Note] Aborted connection 18389 to db: 'unconnected' user: 'replicant' host: 'backup.mrstuudio.ee' (Found net error) 2019-06-29T23:16:24.027270Z 18394 [Note] Start binlog_dump to master_thread_id(18394) slave_server(2), pos(, 4) 2019-06-29T23:16:24.027326Z 18394 [Note] Start semi-sync binlog_dump to slave (server_id: 2), pos(, 4) [SET GLOBAL net_read_timeout = 10;] 2019-06-29T23:16:33.728043Z 18391 [ERROR] Semi-sync master failed on net_flush() before waiting for slave reply 2019-06-29T23:16:33.728099Z 18391 [Note] Stop semi-sync binlog_dump to slave (server_id: 3) 2019-06-29T23:16:33.728157Z 18391 [Note] Aborted connection 18391 to db: 'unconnected' user: 'replicant' host: 'aws.mrstuudio.ee' (Found net error) 2019-06-29T23:16:34.373232Z 18397 [Note] Start binlog_dump to master_thread_id(18397) slave_server(3), pos(, 4) 2019-06-29T23:16:34.373284Z 18397 [Note] Start semi-sync binlog_dump to slave (server_id: 3), pos(, 4) Now the errors stop for both connections. Suggested fix: Master heartbeat sent to slaves should restart the net_read_timeout counter on slave connections with SSL.