MySQL Bugs: #73074: Upgrade from 5.6.20 -> 5.6.21; Replication; 1236 Found old binary log w/o GTID

Bug #73074	Upgrade from 5.6.20 -> 5.6.21; Replication; 1236 Found old binary log w/o GTID
Submitted:	22 Jun 2014 10:48	Modified:	10 Dec 2016 18:34
Reporter:	Van Stokes	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.6.21 x64, 5.6.22	OS:	Any (Windows,Linux)
Assigned to:		CPU Architecture:	Any
Tags:	1236, GTID, replication, upgrade

Description:
Configuration: Four masters in circular (looped) replication.
Binlog format: mixed
OS: Ubuntu 12.04 LTS, 14.04 LTS, and Windows Server 2008

Upgraded all MySQL servers from 5.6.17 to 5.6.19. Upgrade broke replication which was fine prior to the upgrade. From the error log:

2014-06-21 18:05:43 32319 [Note] Slave I/O thread: connected to master 'xxxxxxx@yyy-mysql02.mydomain.com:3306',replication started in log 'master-bin.000336' at position 72188042
2014-06-21 18:05:44 32319 [ERROR] Error reading packet from server: Found old binary log without GTIDs while looking for the oldest binary log that contains any GTID that is not in the given gtid set ( server_errno=1236)
2014-06-21 18:05:44 32319 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Found old binary log without GTIDs while looking for the oldest binary log that contains any GTID that is not in the given gtid set', Error_code: 1236
2014-06-21 18:05:44 32319 [Note] Slave I/O thread exiting, read up to log 'master-bin.000336', position 72188042

All MySQL servers exhibit this error including end point slaves.

How to repeat:
Configure GTID replication using 5.6.17 and then upgrade to 5.6.19.

Suggested fix:
Not sure. Still investigating how to recover without performing a restore.

I believe the problem is in sql/binlog.cc in read_gtids_from_binlog().

Here is our GTID_EXECUTED:

69cf02cd-1731-11e3-9a19-002590854928:1-55306969,
708bb615-d393-11e3-a682-003048c3ab22:1-13491133,
819c985c-d384-11e3-a621-00259002979a:1-1162440,
9204e764-d379-11e3-a5d9-0013726268ea:1-2431

9204e764-d379-11e3-a5d9-0013726268ea is the local MySQL server.

I may be mistaken as I don't have the source installed in a DEV environment to step through it but it appears to me that the logic is attempting to resolve ALL the GTID sets in the same binlog file. Therefore if a file does not contain a GTID for any of the sets (i.e. four in this case) then it fails. It never searches the other previous binlog files.

In our case, it is very possible that a binlog will NOT contain (some or all)  transactions (i.e. GTIDs). For example, this server (9204e764-d379-11e3-a5d9-0013726268ea) is located at our DR site and does not do transactions unless the site is made active. However, it remains in the replication loop to be current.

This problem still persists. We upgraded a slave server from 5.6.19 to 5.6.20 and this error happened again. The slave was working fine and was completely sync'd with the masters prior to the upgrade. After the upgrade we get this error 1236 and it is non-recoverable. We have attempted several suggestions found on the web and none of them have worked. It appears the only solution is to dump and reload from the master.

And the same thing happened when upgrading from 5.6.20 to 5.6.21.
READ-ONLY Slave server is failing with this error:

2014-09-30 11:02:12 12018 [Note] Slave SQL thread initialized, starting replication in log 'FIRST' at position 0, relay log './slave-relay-bin.000001' position: 4
2014-09-30 11:02:12 12018 [Note] Slave I/O thread: connected to master 'rs_2001@atl-mysql02.econocaribe.com:3306',replication started in log 'FIRST' at position 4
2014-09-30 11:02:12 12018 [ERROR] Error reading packet from server: Found old binary log without GTIDs while looking for the oldest binary log that contains any GTID that is not in the given gtid set ( server_errno=1236)
2014-09-30 11:02:12 12018 [ERROR] Slave I/O: Got fatal error 1236 from master when reading data from binary log: 'Found old binary log without GTIDs while looking for the oldest binary log that contains any GTID that is not in the given gtid set', Error_code: 1236
2014-09-30 11:02:12 12018 [Note] Slave I/O thread exiting, read up to log 'FIRST', position 4

Here is the Executed GTID Set:

69cf02cd-1731-11e3-9a19-002590854928:1-68880629,
708bb615-d393-11e3-a682-003048c3ab22:1-17851697,
78ae4d94-d37a-11e3-a5df-005056a25fd0:1-25,
819c985c-d384-11e3-a621-00259002979a:1-7183187,
9204e764-d379-11e3-a5d9-0013726268ea:1-24

I have tried STOP SLAVE -> RESET SLAVE -> START SLAVE
and it will not start.

MASTER server (also upgraded to 5.6.21) is running fine.

Hello Van,

Thank you for the report.
I could not reproduce this issue at my end.
Could you please help us further to reproduce this issue and provide master/slave config files(pls make them private if you prefer) and exact repeatable steps?

Thanks,
Umesh

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

I have the same issue with slave server after upgrade 5.6.20 -> 5.6.21. Slave has  SQL_Delay 129600 seconds.

Just had the same error AGAIN after upgrading from 5.6.21 to 5.6.22.

Got fatal error 1236 from master when reading data from binary log: 'Found old binary log without GTIDs while looking for the oldest binary log that contains any GTID that is not in the given gtid set'

All slaves failed.

Master my.cnf configuration file.

Attachment: master.my.cnf (application/octet-stream, text), 8.45 KiB.

Slave (and master) my.cnf configuration file

Attachment: slave.my.cnf (application/octet-stream, text), 8.47 KiB.

Thank you for the report.

Have you purged, manually deleted binary logs? Have you ever switched from GTID mode to "regular" after setup it?

No. None of the above.

All we did to perform the upgrade was:
1) stop replication (SLAVE STOP)
2) shutdown MySQL server (service mysql stop)
3) perform the upgrade (apt-get update ...)
4) start MySQL server (service mysql start)

and the error occurred on all servers.

We did try just a "SLAVE RESET" on all servers but that didn't work. We then recorded all the Executed GTIDs (per server) and performed a "SLAVE RESET ALL" followed by a CHANGE MASTER and setting the Executed GTIDs but that did not work either.

In order to "fix" the problem, we had to perform a MASTER RESET and a SLAVE RESET ALL on all servers. I shouldn't have to tell you what a catastrophic action this was.

You should be aware that we have FOUR (4) MASTER servers is circular replication (A->B->C->D->A) with each having one or more READ-ONLY slaves. All servers have the same my.cnf settings except for those settings that are server specific. 

I have a sneaking suspicion it has something to do with the consumption of GTID during the MySQL shutdown process that is not (properly?) recorded in the binary log of the MySQL server. See this bug report: 

"Server consumes a GTID on shutdown - slaves show missing executed GTID"
http://bugs.mysql.com/bug.php?id=74687

I think what happened was the MySQL server consumed a GTID but wasn't (properly?) recorded in its binary log. At start up, the slave IO thread is looking for a GTID that doesn't exist in the (first?, recent?) masters binary log and then gives up - or something to that affect. But, this could be a red herring too so I defer to your expertise.

The error message is far to ambiguous for us to trouble shoot. If possible, the error message should be modified to make it clear and easier to trouble shoot this issue. If applicable, the error message should include the server id and GTID(s) that are causing the issue.

Please check if you are getting the same issue upgrading to latest release 5.6.34. Thanks.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".