Description:
Hi,
This feature request is made in the context of my FOSDEM talk "The consequences of sync_binlog != 1" ([1]).
[1]: https://fosdem.org/2020/schedule/event/sync_binlog/
Obviously, running MySQL with sync_binlog != 1 (and with innodb_flush_log_at_trx_commit != 1) is not "safe" from a transaction durability point of view. However, combined with replication, running the "MySQL Replicated Distributed System" in a safe way is the everyday challenge of most DBAs.
In my FOSDEM talk, I point-out that with sync_binlog != 1, the binary logs cannot be trusted after an OS crash (but they can be trusted after a mysqld crash). Making the difference between an OS crash and a mysqld crash, and reacting accordingly, is a major challenge for DBAs, and the current implementation of MySQL is making that reaction complicated.
In my FOSDEM talk, I suggest putting: "offline_mode = ON" in MySQL configuration to avoid slaves and client reconnecting to a master after a crash. This is needed for an OS crash, but it is not needed for a mysqld crash. One way MySQL could be easier to run/operate is that after an OS crash, and when combined with sync_binlog != 1, the server would restart as offline, letting the DBA decide how to safely move forward from there (IMHO the best way forward is failing-over to a slave).
Also, because GTID Replication is not crash safe with sync_binlog != 1 (Bug#70659 and Bug#92109), it is not "safe" to have replication automatically start after an OS crash combined with sync_binlog != 1. What I suggest in my FOSDEM talk it to set "skip-slave-start" in MySQL configuration and to do some voodoo operations to restart replication (this avoids restoring from a backup). One way MySQL could be easier to run/operate is that after an OS crash, and when combined with sync_binlog != 1, replication should not automatically start, and maybe those voodoo operations should be executed by the server itself.
My FOSDEM slides should be online in [1] soon, more details in there (I will also add a comment in the bug with a direct link to the slides).
Many thanks for looking into that, JFG
How to repeat:
N/A to this feature request.
Suggested fix:
1. After an OS crash combined with sync_binlog != 1, the server should automatically restart with offline_mode = ON.
2. Until replication is crash safe with GTID and sync_binlog != 1 and after an OS crash combined with sync_binlog != 1, the server should automatically restart with offline_mode = ON.