MySQL Bugs: #80103: MTS LOGICAL_CLOCK with slave_preserve_commit

Bug #80103	MTS LOGICAL_CLOCK with slave_preserve_commit_order=1 not replication crash safe.
Submitted:	21 Jan 2016 20:45	Modified:	17 Apr 2017 11:43
Reporter:	Jean-François Gagné	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.10, 5.7.12	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi,

Bug #77496 describes that MTS in 5.6.28 and 5.7.10 is not replication crash safe (in all of slave-parallel-type=DATABASE, slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=0, and slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1).  This bug is specifically about slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1.

The manual recovery procedure of Bug #77496 is as follow:

1. restart MySQL with relay-log-recovery=0 and skip-slave-start,
2. run START SLAVE UNTIL SQL_AFTER_MTS_GAPS,
3. once the gap is gone, restart MySQL with relay-log-recovery=1.

However, in the case of slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1, there are no gaps in transaction execution, so starting the slave UNTIL SQL_AFTER_MTS_GAPS does not make much sense.

To me, it looks like the case where slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1 is a special case of Bug #77496, and making MySQL replication crash safe in this case might be easier than the general case.  Hopefully, both cases could be solved at the same time, but maybe this special case could be easier to handled and a fix could come earlier.

Thanks,

JFG

How to repeat:
Crash a slave running in MTS mode, and restart MySQL with relay-log-recovery=1.

Hello Jean,

Thank you for the report and feedback!

Thanks,
Umesh

test results

Attachment: 80103_5.7.12.results (application/octet-stream, text), 16.93 KiB.

Related - Bug #77496

Solution for the above issue has been implemented as part of https://bugs.mysql.com/bug.php?id=77496 fix.
Fix is available in MySQL version 5.7.13.
Hence closing this bug as fixed.

I am not sure this is fully fixed, see Bug#81840.

We are hitting this issue in 5.7.18.
I do not think there are any gaps in the transactions that got executed.
We had to bring down the server for maintenance and upon restart of mysql
We are seeing the following:
2018-01-02T05:43:30.010481Z 0 [ERROR] Error reading slave worker configuration
2018-01-02T05:43:30.010499Z 0 [ERROR] Error creating relay log info: Failed to initialize the worker info structure.
2018-01-02T05:43:30.021464Z 0 [ERROR] Failed to initialize the master info structure
2018-01-02T05:43:30.021483Z 0 [ERROR] Failed to create or recover replication info repositories.
2018-01-02T05:43:30.021489Z 0 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.
2018-01-02T05:43:30.036395Z 0 [Note] Event Scheduler: Loaded 0 events
2018-01-02T05:43:30.036539Z 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.16-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)
2018-01-02T05:43:30.401534Z 0 [Note] InnoDB: Buffer pool(s) load completed at 180102  5:43:30
2018-01-02T05:43:33.906619Z 5 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='172.19.0.6', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='172.19.0.6', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2018-01-02T05:43:37.030644Z 7 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2018-01-02T05:43:40.158675Z 9 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872