MySQL Bugs: #72635: data inconsistencies when master has truncated binary log with GTID after crash

Bug #72635	data inconsistencies when master has truncated binary log with GTID after crash
Submitted:	13 May 2014 18:24	Modified:	8 Dec 2014 15:34
Reporter:	Santosh Praneeth Banda	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6.16, 5.6.17	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Master is running with GTID and sync_binlog=1, innodb_flush_log_at_trx_commit=1.
After crash it may happen that master has truncated binary log due to hardware error (raid cache failure).

Without GTID, slaves fail with error  "Error reading packet from server: Client requested master to start replication from position > file size".

With GTID slaves silently skips transactions since master re-uses same GTIDs
as that of slaves.

This cause data inconsistencies on slave and slaves may fail with duplicate key errors.

How to repeat:
see description

Suggested fix:
Avoid slaves silently skipping transactions.

Updating severity level

Hello Santosh,

Thank you for the bug report.
Verified as described.

Thanks,
Umesh

// Master/Slave with MySQL version 5.6.17

With GTID enabled - None issue reported(Slave up, even syncing new data) but observed data inconsistencies(lost those events which were truncated during crash)

Without GTID enabled - Slave's IO thread stopped with:

 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Client requested master to start replication from position > file size; the first event 'master-bin.000003' at 3382, the last event read from './master-bin.000003' at 4, the last byte read from './master-bin.000003' at 4.'

Thanks for your feedback. The following was added to the 5.6.23 and 5.7.6 changelog with commit 4747:
In normal usage, it is not possible for a slave to have more GTIDs than the master. But in certain situations, such as after a hardware failure or incorrectly cleared gtid_purged, the master's binary log could be truncated. This fix ensures that in such a situation, the master now detects that the slave has transactions with GTIDs which are not on the master. An error is now generated on the slave and the I/O thread is stopped with an error. The master's dump thread is also stopped. This prevents data inconsistencies during replication.

$ git show -s 6e6add6
commit 6e6add6bb5649b6f75579c86f5a4a51e95c54fb6
Author: Venkatesh Duggirala <venkatesh.duggirala@oracle.com>
Date:   Tue Nov 18 09:54:31 2014 +0530

    Bug #18789758  DATA INCONSISTENCIES WHEN MASTER HAS TRUNCATED
          BINARY LOG WITH GTID AFTER CRASH
          Problem:
           Master's dump thread is not detecting the case where Slave's
           gtid executed set is having more gtids than Master's gtid
           executed set with respective to Master's UUID.
    
          Analysis & Fix:
           In normal scenarios, it is not possible that Slave will
           contain more gtids than Master with respective to Master's UUID.
           But it could be possible case if Master's binary log is
           truncated(due to raid failure) or Master's binary log is
           deleted but GTID_PURGED was not set properly. That scenario
           needs to be validated, i.e., it should *always* be the case that
           Slave's gtid executed set (+retrieved set) is a subset of
           Master's gtid executed set with respective to Master's UUID.
           If it happens, Master's dump thread will be stopped and this
           situation will be informed to Slave during the handshake (thus.
           slave I/O thread also be stopped with an error
           (ER_MASTER_FATAL_ERROR_READING_BINLOG). Otherwise, it can lead
           to data inconsistency between Master and Slave.