MySQL Bugs: #101874: The manual is missing sync_binlog = 1 for GTID replication crash safety.

Bug #101874	The manual is missing sync_binlog = 1 for GTID replication crash safety.
Submitted:	5 Dec 2020 1:29	Modified:	22 Nov 2022 21:33
Reporter:	Jean-François Gagné	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S2 (Serious)
Version:	5.7.32, 8.0.16	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

while preparing my talk for the MinervaDB Athena 2020 Conference [1], I was reviewing old bugs and the way they were fixed.  I should have done that before, sorry for the delay.

[1]: https://minervadbathena.com/

So Bug#92882 ended-up updating [2] and [3] (Handling an Unexpected Halt of a Replica), and in both of these, I do not see sync_binlog as being important for replication crash safety.  From what I see, Bug#70659 and Bug#92109 are not closed, so GTID replication with log-slave-updates enabled should not be replication crash-safe with sync_binlog different than 1.  And effectively, testing with 5.7.32 and 8.0.16 shows me that replication breaks after an OS crash when using GTID and sync_binlog = 0.  See How to repeat for details.

[2]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html

[3]: https://dev.mysql.com/doc/refman/8.0/en/replication-solutions-unexpected-replica-halt.html

For 8.0, I was able to make replication break with 8.0.16, but not with 8.0.17 and 8.0.22.  I suspect that WL#9211 [4] is making GTID replication crash-safe with sync_binlog = 0, but not reproducing a replication breakage does not mean that replication is crash safe.  Did adding the GTID position in the InnoDB logs allows to replication crash-safety for GITD and sync_binlog != 1 ?  If yes, should Bug#70659 and Bug#92109 be closed ?

[4]: https://dev.mysql.com/worklog/task/?id=9211

Also note that in how to repeat, I try to reproduce a replication breakage without success with sync_binlog = 1 and without sync_relay_log = 1.  In [2] and [3], it is written that sync_relay_log = 1 is needed for replication crash safety, but I do not think it is always the case and my test are confirming this.  I will open another bug about this and post the link in the comments.

Many thanks for looking in to this,

Jean-François Gagné

How to repeat:
# Create a replicated dbdeployer instance with 5.7.32.  Note that I use log-slave-updates as without it, the GTID position of the slave is persisted in a table, and this is crash-safe.

jgagne@jfg-dbdeployer-1:~/$ dbdeployer deploy replication mysql_5.7.32 --gtid -c log-slave-updates
Installing and starting master
. sandbox server started
Installing and starting slave1
. sandbox server started
Installing and starting slave2
. sandbox server started
$HOME/sandboxes/rsandbox_mysql_5_7_32/initialize_slaves
initializing slave 1
initializing slave 2
Replication directory installed in $HOME/sandboxes/rsandbox_mysql_5_7_32
run 'dbdeployer usage multiple' for basic instructions'

# Stop both slaves.

jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node1/stop & ./node2/stop & wait
[1] 1814
[2] 1815
stop /home/jgagne/sandboxes/rsandbox_mysql_5_7_32/node1
stop /home/jgagne/sandboxes/rsandbox_mysql_5_7_32/node2
[1]-  Done                    ./node1/stop

# Initialize the master, shutdown MySQL, and reboot to make sure everything is on disk for the master.

jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/use <<< "
  create database test_jfg;
  create table test_jfg.t(id bigint not null primary key);
  set global sync_binlog = 0;
  set global innodb_flush_log_at_trx_commit = 0"
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ seq -f "insert into t value(%.0f);" 1 500000 | ./master/use test_jfg
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/stop; sudo reboot
stop /home/jgagne/sandboxes/rsandbox_mysql_5_7_32/master
client_loop: send disconnect: Broken pipe
ERROR: (gcloud.beta.compute.ssh) [/usr/bin/ssh] exited with return code [255].

# After the reboot, start master and slave, but with replication stopped.  Before staring replication and crashing the vm, show that all crash-safe parameters are set as specified in [1].  But set sync_binlog to 0 (not mentioned in [1]).

[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html

jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/start & ./node1/start --skip-slave-start & wait
[1] 627
[2] 628
.... sandbox server started
 sandbox server started
[1]-  Done                    ./master/start
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ for v in innodb_flush_log_at_trx_commit relay_log_{info_repository,recovery}; do ./node1/use -N <<< "show global variables like '$v'"; done
innodb_flush_log_at_trx_commit  1
relay_log_info_repository       TABLE
relay_log_recovery      ON
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node1/use <<< "set global sync_relay_log = 1; set global sync_binlog = 0; start slave; do sleep(2); show slave status\G" && sudo bash -c "echo c > /proc/sysrq-trigger"
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19933
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 56915
               Relay_Log_File: mysql-relay.000004
                Relay_Log_Pos: 1307
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4982
              Relay_Log_Space: 53443
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 144
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 19933
                  Master_UUID: 00019933-1111-1111-1111-111111111111
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: Reading event from the relay log
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 00019933-1111-1111-1111-111111111111:17-215
            Executed_Gtid_Set: 00019933-1111-1111-1111-111111111111:1-20
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
client_loop: send disconnect: Broken pipe
ERROR: (gcloud.beta.compute.ssh) [/usr/bin/ssh] exited with return code [255].

# After the vm is back up, start master and slave, and if replication was crash-safe, it should resume without problems.  As shown below, it did not.  So replication is not crash-safe with all the parameters specified in [1] and with sync_binlog = 0.

[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html

jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/start; ./node1/start; ./node1/use <<< "show slave status\G"
. sandbox server started
.. sandbox server started
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19933
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 70057646
               Relay_Log_File: mysql-relay.000006
                Relay_Log_Pos: 414
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1007
                   Last_Error: Error 'Can't create database 'test_jfg'; database exists' on query. Default database: 'test_jfg'. Query: 'create database test_jfg'
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 4089
              Relay_Log_Space: 70054174
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1007
               Last_SQL_Error: Error 'Can't create database 'test_jfg'; database exists' on query. Default database: 'test_jfg'. Query: 'create database test_jfg'
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 19933
                  Master_UUID: 00019933-1111-1111-1111-111111111111
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 201204 13:29:33
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 00019933-1111-1111-1111-111111111111:17-265368
            Executed_Gtid_Set: 00019933-1111-1111-1111-111111111111:1-16
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:

# Note that after showing a replication breakage with the 1st slave, I tried showing that replication is more resilient with sync_binlog = 1 (not necessarily crash-safe as it is not possible to prove crash-safety this way).  Note that in below, I do not set sync_relay_log to 1 as specified in [1].

[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html

jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node2/start --skip-slave-start
.. sandbox server started
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ for v in sync_binlog sync_relay_log innodb_flush_log_at_trx_commit relay_log_{info_repository,recovery}; do ./node2/use -N <<< "show global variables like '$v'"; done
sync_binlog	1
sync_relay_log	10000
innodb_flush_log_at_trx_commit	1
relay_log_info_repository	TABLE
relay_log_recovery	ON
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node2/use <<< "start slave; do sleep(2); show slave status\G" && sudo bash -c "echo c > /proc/sysrq-trigger"
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19933
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 23222915
               Relay_Log_File: mysql-relay.000004
                Relay_Log_Pos: 101099
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 104774
              Relay_Log_Space: 23219443
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 325
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 19933
                  Master_UUID: 00019933-1111-1111-1111-111111111111
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: System lock
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 00019933-1111-1111-1111-111111111111:17-87964
            Executed_Gtid_Set: 00019933-1111-1111-1111-111111111111:1-398
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
client_loop: send disconnect: Broken pipe
ERROR: (gcloud.beta.compute.ssh) [/usr/bin/ssh] exited with return code [255].

# I did below 10 times, and it was always able to resume replication.  So GTID replication with sync_binlog = 1 is more resilient than with sync_binlog = 0. 
 And this did not ned sync_relay_log to 1.  I will open another bug about this and post a link in the comments.

jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./master/start & ./node2/start --skip-slave-start & wait
[1] 639
[2] 640
.... sandbox server started
 sandbox server started
[1]-  Done                    ./master/start
jgagne@jfg-dbdeployer-1:~/sandboxes/rsandbox_mysql_5_7_32$ ./node2/use <<< "start slave; do sleep(2); show slave status\G"; sudo bash -c "echo c > /proc/sysrq-trigger"
*************************** 1. row ***************************
               Slave_IO_State: Queueing master event to the relay log
                  Master_Host: 127.0.0.1
                  Master_User: rsandbox
                  Master_Port: 19933
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 24420758
               Relay_Log_File: mysql-relay.000006
                Relay_Log_Pos: 107334
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 234926
              Relay_Log_Space: 24293369
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 585
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 19933
                  Master_UUID: 00019933-1111-1111-1111-111111111111
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: System lock
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp:
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set: 00019933-1111-1111-1111-111111111111:487-92500
            Executed_Gtid_Set: 00019933-1111-1111-1111-111111111111:1-891
                Auto_Position: 1
         Replicate_Rewrite_DB:
                 Channel_Name:
           Master_TLS_Version:
client_loop: send disconnect: Broken pipe
ERROR: (gcloud.beta.compute.ssh) [/usr/bin/ssh] exited with return code [255].

# Also, note I am easily able to reproduce a replication breakage with 8.0.16, but not with 8.0.17 and 8.0.22.  I suspect that something changed in 8.0.17 and I think WL#9211 [2] might have something to do with this.  I will update Bug#70659 and Bug#92109 about this.

[2]: https://dev.mysql.com/worklog/task/?id=9211

Suggested fix:
At least in [1], and maybe in [2], mention that sync_binlog is needed for replication crash safety when GTID is used with log-slave-updates.

[1]: https://dev.mysql.com/doc/refman/5.7/en/replication-solutions-unexpected-replica-halt.html

[2]: https://dev.mysql.com/doc/refman/8.0/en/replication-solutions-unexpected-replica-halt.html

Hello Jean-François,

Thank you for the report!

regards,
Umesh

Related: Bug#101876 -  The manual is overly conservative in parameters for replication crash safety.

This appears to have been fixed already by changes made 20201031.

Closed.