Description:
Hi,
while preparing my talk for the MinervaDB Athena 2020 Conference [1], I was reviewing old bugs and the way they were fixed. I should have done that before, sorry for the delay.
[1]: https://minervadbathena.com/
So Bug#92882 ended-up updating [2] and [3] (Handling an Unexpected Halt of a Replica), and in both of these, I do not see sync_binlog as being important for replication crash safety. From what I see, Bug#70659 and Bug#92109 are not closed, so GTID replication with log-slave-updates enabled should not be replication crash-safe with sync_binlog different than 1. And effectively, testing with 5.7.32 and 8.0.16 shows me that replication breaks after an OS crash when using GTID and sync_binlog = 0. See How to repeat for details.
[2]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html
[3]: https://dev.mysql.com/doc/refman/8.0/en/replication-solutions-unexpected-replica-halt.html
For 8.0, I was able to make replication break with 8.0.16, but not with 8.0.17 and 8.0.22. I suspect that WL#9211 [4] is making GTID replication crash-safe with sync_binlog = 0, but not reproducing a replication breakage does not mean that replication is crash safe. Did adding the GTID position in the InnoDB logs allows to replication crash-safety for GITD and sync_binlog != 1 ? If yes, should Bug#70659 and Bug#92109 be closed ?
[4]: https://dev.mysql.com/worklog/task/?id=9211
Also note that in how to repeat, I try to reproduce a replication breakage without success with sync_binlog = 1 and without sync_relay_log = 1. In [2] and [3], it is written that sync_relay_log = 1 is needed for replication crash safety, but I do not think it is always the case and my test are confirming this. I will open another bug about this and post the link in the comments.
Many thanks for looking in to this,
Jean-François Gagné
How to repeat:
# Create a replicated dbdeployer instance with 5.7.32. Note that I use log-slave-updates as without it, the GTID position of the slave is persisted in a table, and this is crash-safe.
jgagne@jfg-dbdeployer-1:~/$ dbdeployer deploy replication mysql_5.7.32 --gtid -c log-slave-updates
Installing and starting master
. sandbox server started
Installing and starting slave1
. sandbox server started
Installing and starting slave2
. sandbox server started
$HOME/sandboxes/rsandbox_mysql_5_7_32/initialize_slaves
initializing slave 1
initializing slave 2
Replication directory installed in $HOME/sandboxes/rsandbox_mysql_5_7_32
run 'dbdeployer usage multiple' for basic instructions'
# Stop both slaves.
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node1/stop & ./node2/stop & wait
[1] 1814
[2] 1815
stop /home/jgagne/sandboxes/rsandbox_mysql_5_7_32/node1
stop /home/jgagne/sandboxes/rsandbox_mysql_5_7_32/node2
[1]- Done ./node1/stop
# Initialize the master, shutdown MySQL, and reboot to make sure everything is on disk for the master.
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/use <<< "
create database test_jfg;
create table test_jfg.t(id bigint not null primary key);
set global sync_binlog = 0;
set global innodb_flush_log_at_trx_commit = 0"
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ seq -f "insert into t value(%.0f);" 1 500000 | ./master/use test_jfg
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/stop; sudo reboot
stop /home/jgagne/sandboxes/rsandbox_mysql_5_7_32/master
client_loop: send disconnect: Broken pipe
ERROR: (gcloud.beta.compute.ssh) [/usr/bin/ssh] exited with return code [255].
# After the reboot, start master and slave, but with replication stopped. Before staring replication and crashing the vm, show that all crash-safe parameters are set as specified in [1]. But set sync_binlog to 0 (not mentioned in [1]).
[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/start & ./node1/start --skip-slave-start & wait
[1] 627
[2] 628
.... sandbox server started
sandbox server started
[1]- Done ./master/start
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ for v in innodb_flush_log_at_trx_commit relay_log_{info_repository,recovery}; do ./node1/use -N <<< "show global variables like '$v'"; done
innodb_flush_log_at_trx_commit 1
relay_log_info_repository TABLE
relay_log_recovery ON
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node1/use <<< "set global sync_relay_log = 1; set global sync_binlog = 0; start slave; do sleep(2); show slave status\G" && sudo bash -c "echo c > /proc/sysrq-trigger"
*************************** 1. row ***************************
Slave_IO_State: Queueing master event to the relay log
Master_Host: 127.0.0.1
Master_User: rsandbox
Master_Port: 19933
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 56915
Relay_Log_File: mysql-relay.000004
Relay_Log_Pos: 1307
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 4982
Relay_Log_Space: 53443
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 144
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 19933
Master_UUID: 00019933-1111-1111-1111-111111111111
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Reading event from the relay log
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 00019933-1111-1111-1111-111111111111:17-215
Executed_Gtid_Set: 00019933-1111-1111-1111-111111111111:1-20
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
client_loop: send disconnect: Broken pipe
ERROR: (gcloud.beta.compute.ssh) [/usr/bin/ssh] exited with return code [255].
# After the vm is back up, start master and slave, and if replication was crash-safe, it should resume without problems. As shown below, it did not. So replication is not crash-safe with all the parameters specified in [1] and with sync_binlog = 0.
[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/start; ./node1/start; ./node1/use <<< "show slave status\G"
. sandbox server started
.. sandbox server started
*************************** 1. row ***************************
Slave_IO_State: Queueing master event to the relay log
Master_Host: 127.0.0.1
Master_User: rsandbox
Master_Port: 19933
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 70057646
Relay_Log_File: mysql-relay.000006
Relay_Log_Pos: 414
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: No
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 1007
Last_Error: Error 'Can't create database 'test_jfg'; database exists' on query. Default database: 'test_jfg'. Query: 'create database test_jfg'
Skip_Counter: 0
Exec_Master_Log_Pos: 4089
Relay_Log_Space: 70054174
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 1007
Last_SQL_Error: Error 'Can't create database 'test_jfg'; database exists' on query. Default database: 'test_jfg'. Query: 'create database test_jfg'
Replicate_Ignore_Server_Ids:
Master_Server_Id: 19933
Master_UUID: 00019933-1111-1111-1111-111111111111
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State:
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp: 201204 13:29:33
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 00019933-1111-1111-1111-111111111111:17-265368
Executed_Gtid_Set: 00019933-1111-1111-1111-111111111111:1-16
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
# Note that after showing a replication breakage with the 1st slave, I tried showing that replication is more resilient with sync_binlog = 1 (not necessarily crash-safe as it is not possible to prove crash-safety this way). Note that in below, I do not set sync_relay_log to 1 as specified in [1].
[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node2/start --skip-slave-start
.. sandbox server started
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ for v in sync_binlog sync_relay_log innodb_flush_log_at_trx_commit relay_log_{info_repository,recovery}; do ./node2/use -N <<< "show global variables like '$v'"; done
sync_binlog 1
sync_relay_log 10000
innodb_flush_log_at_trx_commit 1
relay_log_info_repository TABLE
relay_log_recovery ON
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node2/use <<< "start slave; do sleep(2); show slave status\G" && sudo bash -c "echo c > /proc/sysrq-trigger"
*************************** 1. row ***************************
Slave_IO_State: Queueing master event to the relay log
Master_Host: 127.0.0.1
Master_User: rsandbox
Master_Port: 19933
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 23222915
Relay_Log_File: mysql-relay.000004
Relay_Log_Pos: 101099
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 104774
Relay_Log_Space: 23219443
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 325
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 19933
Master_UUID: 00019933-1111-1111-1111-111111111111
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: System lock
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 00019933-1111-1111-1111-111111111111:17-87964
Executed_Gtid_Set: 00019933-1111-1111-1111-111111111111:1-398
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
client_loop: send disconnect: Broken pipe
ERROR: (gcloud.beta.compute.ssh) [/usr/bin/ssh] exited with return code [255].
# I did below 10 times, and it was always able to resume replication. So GTID replication with sync_binlog = 1 is more resilient than with sync_binlog = 0.
And this did not ned sync_relay_log to 1. I will open another bug about this and post a link in the comments.
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/start & ./node2/start --skip-slave-start & wait
[1] 639
[2] 640
.... sandbox server started
sandbox server started
[1]- Done ./master/start
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node2/use <<< "start slave; do sleep(2); show slave status\G"; sudo bash -c "echo c > /proc/sysrq-trigger"
*************************** 1. row ***************************
Slave_IO_State: Queueing master event to the relay log
Master_Host: 127.0.0.1
Master_User: rsandbox
Master_Port: 19933
Connect_Retry: 60
Master_Log_File: mysql-bin.000001
Read_Master_Log_Pos: 24420758
Relay_Log_File: mysql-relay.000006
Relay_Log_Pos: 107334
Relay_Master_Log_File: mysql-bin.000001
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 234926
Relay_Log_Space: 24293369
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: 585
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error:
Last_SQL_Errno: 0
Last_SQL_Error:
Replicate_Ignore_Server_Ids:
Master_Server_Id: 19933
Master_UUID: 00019933-1111-1111-1111-111111111111
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: System lock
Master_Retry_Count: 86400
Master_Bind:
Last_IO_Error_Timestamp:
Last_SQL_Error_Timestamp:
Master_SSL_Crl:
Master_SSL_Crlpath:
Retrieved_Gtid_Set: 00019933-1111-1111-1111-111111111111:487-92500
Executed_Gtid_Set: 00019933-1111-1111-1111-111111111111:1-891
Auto_Position: 1
Replicate_Rewrite_DB:
Channel_Name:
Master_TLS_Version:
client_loop: send disconnect: Broken pipe
ERROR: (gcloud.beta.compute.ssh) [/usr/bin/ssh] exited with return code [255].
# Also, note I am easily able to reproduce a replication breakage with 8.0.16, but not with 8.0.17 and 8.0.22. I suspect that something changed in 8.0.17 and I think WL#9211 [2] might have something to do with this. I will update Bug#70659 and Bug#92109 about this.
[2]: https://dev.mysql.com/worklog/task/?id=9211
Suggested fix:
At least in [1], and maybe in [2], mention that sync_binlog is needed for replication crash safety when GTID is used with log-slave-updates.
[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html
[2]: https://dev.mysql.com/doc/refman/8.0/en/replication-solutions-unexpected-replica-halt.html