Bug #78061 GTID replication halts with Error_code: 1032; handler error HA_ERR_END_OF_FILE
Submitted: 13 Aug 2015 21:53 Modified: 17 Apr 20:27
Reporter: Steven Douglas Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Row Based Replication ( RBR ) Severity:S1 (Critical)
Version:5.6.25 OS:FreeBSD (10.1)
Assigned to: Bogdan Kecman CPU Architecture:Any

[13 Aug 2015 21:53] Steven Douglas
Description:
We are running MySQL Server 5.6.25 on FreeBSD 10.1. My colleagues and I think we have stumbled upon a bug in MySQL GTID replication. We run GTID with many clients, but have this issue with replication halting with a new client. 

We have replication set up between two servers, DB1 and DB2. User database traffic is always sent to DB1 unless there is a catastrophic failure, in which case DB2 would take over. The replication is row based (RBR) because the client's triggers are nondeterministic. This client has a lot of DB traffic, but isn't very heavy on writes. 

Over the last 2-3 weeks we have struggled to keep replication running on DB2. Every few hours, replication halts with one of these errors on the slave:

Could not execute Update_rows event on table customer1.table1; Can't find record in 'table1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log db1-bin.000001, end_log_pos 344884

Could not execute Update_rows event on table customer1.table1; Can't find record in 'table1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log db1-bin.000001, end_log_pos 9685352 

Could not execute Update_rows event on table customer2.table1; Can't find record in 'table1', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log db1-bin.000001, end_log_pos 28653123

Could not execute Delete_rows event on table customer3.table2; Can't find record in 'table2', Error_code: 1032; handler error HA_ERR_END_OF_FILE; the event's master log db1-bin.000001, end_log_pos 179801477

We can get replication running again by doing a dump on DB1 and a restore on DB2. Before starting replication, we use pt-table-checksum to verify the databases were identical. Replication starts. We continuously run pt-table-checksum, and the tables are identical right up until it eventually halts again (anywhere from 1 hour to 24 hours).

We have also tried hooking up a third server, DB3 into the replication setup. DB2 and DB3's replication halted at the same time with the same error.

We also firewalled DB2, to ensure it was not getting any requests from the application, but it did not help.

Here is a configuration snippet from DB2 (DB1 is identical in relevant places):
[mysqld]
gtid_mode=ON
enforce-gtid-consistency = 1
log-bin
log-slave-updates
log-bin=db2-bin 
relay-log=db2-relay-bin
relay-log-recovery = 1
relay-log-info-repository = TABLE
relay_log_purge = 1
server-id = 2
replicate-same-server-id = 0
auto-increment-increment = 2
auto-increment-offset = 2
expire_logs_days = 14
back_log = 1000
expire_logs_days = 14
innodb_file_per_table
log-bin-trust-function-creators = 1
binlog_format = row

key_buffer = 384M
max_allowed_packet = 32M
sort_buffer_size = 1M
read_buffer_size = 1M

innodb_log_file_size = 100M
innodb_flush_log_at_trx_commit = 2

We are stuck, and we think this may be a bug in MySQL. GTID has run flawlessly for us for years with other clients but that was using mixed/statement replication.

How to repeat:
Setup RBR and do continuous updates and deletes.
[2 Sep 2015 7:48] Steven Douglas
After fighting with this for a few months, we think we have a workaround.

To be clear, we believe that this is still a bug in the MySQL GTID row based replication. What we have is a workaround. Because of the nature of the bug, we can see why this bug would be hard to find, but believe it is a valid use case.

We started to notice a trend in the tables that were having errors. None of them have unique (PRIMARY) indexes. This seems to break replication, depending on traffic, in 1-24 hours. We are still unable to determine what exact operation actually breaks replication.

After applying an auto incremented primary id to the table, all the issues disappear. This is a pain, as our client has a high volume of affected tables, but we are relieved to have a workaround.
[23 Sep 2015 0:56] Steven Douglas
Our company wrote up a post on our website about this issue: https://www.ateamsystems.com/tech-blog/mysql-rbr-row-based-replication-w-gtid-halts-with-e...
[14 Feb 2016 15:18] Vitaly Karasik
is there any updates for this bug?
[13 Jun 2017 5:00] Marycole Pernitiz
We are having similar problem. 

We had it last year but it works on its own without us knowing why it work. 

Now, this error occurs again but this time it comes back  after a few hours of doing db restoration, very similar problem to the above.

Our MySQL version is 5.7.13
[19 Jul 2017 16:01] Bobby Iliev
We've got the same problem here. After running SQL_SLAVE_SKIP_COUNTER a couple of times and restating the slave server, the occasionally replication starts again. But this is not ideal as we need to constantly monitor the status.

Do you think that switching from RBR to Mixed, would be a good idea?

Also has anyone test that with MariaDB?
[17 Apr 20:27] Bogdan Kecman
Hi,

There is more then one instance on the Internet explaining why tables without PK are not going to properly work with RBR. I cannot reproduce your issue with any of the modern versions of MySQL but in any case, you will never have decent rbr with tables that are without PK.

best regards
Bogdan