MySQL Bugs: #95395: Improve logging of Error 1236 (missing GTIDs on master)

Bug #95395	Improve logging of Error 1236 (missing GTIDs on master)
Submitted:	16 May 2019 18:58	Modified:	3 Sep 2019 4:40
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S4 (Feature request)
Version:	5.7 but I think 8.0 does this too	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
What starting a slave I get the following error message:

2019-05-16T18:46:20.411527Z 22024 [ERROR] Slave I/O for channel '': Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Error_code: 1236

I bump into this from time to time and to validate which entries are missing takes a bit more time. Another slave or intermediate master may have these values so it's easier to search gtid values on other servers to see if I can find an alternative server for initial "catch up".

How to repeat:
Purge some binlogs on a master with a slave that's "too far behind" and issue start slave.
See the warning message.
To get the information you need you need to go onto the slave to execute a few commands, do the same on the the master, and then you can figure out which missing GTIDs are involved.

Doing this manually takes time.
Providing the information in the slave's log means there is more auditing information which is convenient and it helps speed up diagnosis of errors and issues.

Suggested fix:

It would be most helpful to provide in the error message more complete reporting:
* the upstream master hostname / server_id / server_uuid
* the GTID_EXECUTED values for master and slave
* the GTID_PURGED value of the master
* this can be implied but it saves people doing "calculations": the explicit list of missing GTIDs that the slave needs and the master does not have.

Hello Simon,

Thank you for the feature request.

regards,
Umesh

[25 July 2019 3:09] Sven Sandberg 

Thanks for this bug report, Simon. I agree that this message is not optimal.

It was partially improved in 8.0, so the message in the *master* log is now:

Cannot replicate to server with server_uuid='%.36s' because the present
server has purged required binary logs. The connecting server needs to
replicate the missing transactions from elsewhere, or be replaced by a new
server created from a more recent backup. To prevent this error in the
future, consider increasing the binary log expiration period on the present
server. The missing transactions are '%s'.

... whereas on the *slave* you will see the following in SHOW SLAVE STATUS
and performance_schema.replication_connection_status:

Cannot replicate because the master purged required binary logs. Replicate
the missing transactions from elsewhere, or provision a new slave from
backup. Consider increasing the master's binary log expiration period. To
find the missing transactions, see the master's error log or the manual for
GTID_SUBTRACT.

I understand both that adding more information would be useful, and that
having it on the slave would be useful.

How much would a backport of the message above help? (i.e. a bit more
information, but not all that you request, and only in the master-side log)

Is your request strictly for 5.7, or would a solution in 8.0 help?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Sorry for delay in my response.

I still think that the following information is not provided and think it would be good to provide it.

Doing this (backporting) to 5.7 is NOT necessary, if changes are needed it is fine to go into the current GA release (currently 8.0).

So I see the master's logging is better: that's good.
* I would still like to see the ip address / port and server_id of the slave as that simplifies forensics.

For the slave it looks like this information is missing and I'd like to see it.
* Reasoning: it's the slave that has a problem replicating not the master, so the more complete information should be available on the slave to help diagnosis and resolution. That's not happening now. The helpful text about what to do does not give explicit information about GTID differences on master and slave, and as stated before it's also good to reference host / ip and port , plus uuid , server_id of the master too.
* Context: if you have tiered replication with multiple intermediate masters then the actual master being reached may not be completely obvious or may change a lot over time, so making this information visible ensures that you don't need to try to imply it from other information. The auditing side (the information being in the logs) is also invaluable.