Bug #98448 Please make running MySQL with sync_binlog != 1 safer.
Submitted: 31 Jan 2020 11:42 Modified: 2 Mar 2020 15:01
Reporter: Jean-François Gagné Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S4 (Feature request)
Version:5.7, 8.0 OS:Any
Assigned to: CPU Architecture:Any

[31 Jan 2020 11:42] Jean-François Gagné
Description:
Hi,

This feature request is made in the context of my FOSDEM talk "The consequences of sync_binlog != 1" ([1]).

[1]: https://fosdem.org/2020/schedule/event/sync_binlog/

Obviously, running MySQL with sync_binlog != 1 (and with innodb_flush_log_at_trx_commit != 1) is not "safe" from a transaction durability point of view.  However, combined with replication, running the "MySQL Replicated Distributed System" in a safe way is the everyday challenge of most DBAs.

In my FOSDEM talk, I point-out that with sync_binlog != 1, the binary logs cannot be trusted after an OS crash (but they can be trusted after a mysqld crash).  Making the difference between an OS crash and a mysqld crash, and reacting accordingly, is a major challenge for DBAs, and the current implementation of MySQL is making that reaction complicated.

In my FOSDEM talk, I suggest putting: "offline_mode = ON" in MySQL configuration to avoid slaves and client reconnecting to a master after a crash.  This is needed for an OS crash, but it is not needed for a mysqld crash.  One way MySQL could be easier to run/operate is that after an OS crash, and when combined with sync_binlog != 1, the server would restart as offline, letting the DBA decide how to safely move forward from there (IMHO the best way forward is failing-over to a slave).

Also, because GTID Replication is not crash safe with sync_binlog != 1 (Bug#70659 and Bug#92109), it is not "safe" to have replication automatically start after an OS crash combined with sync_binlog != 1.  What I suggest in my FOSDEM talk it to set "skip-slave-start" in MySQL configuration and to do some voodoo operations to restart replication (this avoids restoring from a backup).  One way MySQL could be easier to run/operate is that after an OS crash, and when combined with sync_binlog != 1, replication should not automatically start, and maybe those voodoo operations should be executed by the server itself.

My FOSDEM slides should be online in [1] soon, more details in there (I will also add a comment in the bug with a direct link to the slides).

Many thanks for looking into that, JFG

How to repeat:
N/A to this feature request.

Suggested fix:
1. After an OS crash combined with sync_binlog != 1, the server should automatically restart with offline_mode = ON.

2. Until replication is crash safe with GTID and sync_binlog != 1 and after an OS crash combined with sync_binlog != 1, the server should automatically restart with offline_mode = ON.
[2 Feb 2020 7:13] MySQL Verification Team
Hello Jean-François,

Thank you for the feature request!

regards,
Umesh
[2 Mar 2020 15:01] Jean-François Gagné
The way to salvage a GTID slave running with sync_binlog != 1 is described in [1], and it involves starting with skip-slave-start.

https://www.slideshare.net/JeanFranoisGagn/the-consequences-of-syncbinlog-1/25