Bug #80103 MTS LOGICAL_CLOCK with slave_preserve_commit_order=1 not replication crash safe.
Submitted: 21 Jan 2016 20:45 Modified: 17 Apr 2017 11:43
Reporter: Jean-François Gagné Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.10, 5.7.12 OS:Any
Assigned to: CPU Architecture:Any

[21 Jan 2016 20:45] Jean-François Gagné
Description:
Hi,

Bug #77496 describes that MTS in 5.6.28 and 5.7.10 is not replication crash safe (in all of slave-parallel-type=DATABASE, slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=0, and slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1).  This bug is specifically about slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1.

The manual recovery procedure of Bug #77496 is as follow:

1. restart MySQL with relay-log-recovery=0 and skip-slave-start,
2. run START SLAVE UNTIL SQL_AFTER_MTS_GAPS,
3. once the gap is gone, restart MySQL with relay-log-recovery=1.

However, in the case of slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1, there are no gaps in transaction execution, so starting the slave UNTIL SQL_AFTER_MTS_GAPS does not make much sense.

To me, it looks like the case where slave-parallel-type=LOGICAL_CLOCK with slave_preserve_commit_order=1 is a special case of Bug #77496, and making MySQL replication crash safe in this case might be easier than the general case.  Hopefully, both cases could be solved at the same time, but maybe this special case could be easier to handled and a fix could come earlier.

Thanks,

JFG

How to repeat:
Crash a slave running in MTS mode, and restart MySQL with relay-log-recovery=1.
[3 May 2016 8:01] MySQL Verification Team
Hello Jean,

Thank you for the report and feedback!

Thanks,
Umesh
[3 May 2016 8:02] MySQL Verification Team
test results

Attachment: 80103_5.7.12.results (application/octet-stream, text), 16.93 KiB.

[3 May 2016 8:04] MySQL Verification Team
Related - Bug #77496
[6 Jun 2016 7:05] Sujatha Sivakumar
Solution for the above issue has been implemented as part of https://bugs.mysql.com/bug.php?id=77496 fix.
Fix is available in MySQL version 5.7.13.
Hence closing this bug as fixed.
[17 Apr 2017 11:43] Jean-François Gagné
I am not sure this is fully fixed, see Bug#81840.
[4 Jan 2018 8:51] Prasad N
We are hitting this issue in 5.7.18.
I do not think there are any gaps in the transactions that got executed.
We had to bring down the server for maintenance and upon restart of mysql
We are seeing the following:
2018-01-02T05:43:30.010481Z 0 [ERROR] Error reading slave worker configuration
2018-01-02T05:43:30.010499Z 0 [ERROR] Error creating relay log info: Failed to initialize the worker info structure.
2018-01-02T05:43:30.021464Z 0 [ERROR] Failed to initialize the master info structure
2018-01-02T05:43:30.021483Z 0 [ERROR] Failed to create or recover replication info repositories.
2018-01-02T05:43:30.021489Z 0 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.
2018-01-02T05:43:30.036395Z 0 [Note] Event Scheduler: Loaded 0 events
2018-01-02T05:43:30.036539Z 0 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.7.16-log'  socket: '/var/lib/mysql/mysql.sock'  port: 3306  MySQL Community Server (GPL)
2018-01-02T05:43:30.401534Z 0 [Note] InnoDB: Buffer pool(s) load completed at 180102  5:43:30
2018-01-02T05:43:33.906619Z 5 [Note] 'CHANGE MASTER TO FOR CHANNEL '' executed'. Previous state master_host='172.19.0.6', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''. New state master_host='172.19.0.6', master_port= 3306, master_log_file='', master_log_pos= 4, master_bind=''.
2018-01-02T05:43:37.030644Z 7 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872
2018-01-02T05:43:40.158675Z 9 [ERROR] Slave SQL for channel '': Slave failed to initialize relay log info structure from the repository, Error_code: 1872