MySQL Bugs: #35785: Replication stops after upgrading to 5.1.23

Bug #35785	Replication stops after upgrading to 5.1.23
Submitted:	3 Apr 2008 7:32	Modified:	9 May 2008 10:19
Reporter:	Stefan Hinz	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S3 (Non-critical)
Version:	5.1.23	OS:	Any
Assigned to:	Jon Stephens	CPU Architecture:	Any

Description:
After upgrading from 5.1.22 to 5.1.23, replication stopped for all my slave servers. This looks like a bug that could be a regression of Bug#28597 which has been fixed some time ago. SHOW SLAVE STATUS gives me this:

Slave_IO_State: Waiting to reconnect after a failed master event read
 Master_Host: themis
...
 Last_IO_Errno: 2013
 Last_IO_Error: error reconnecting to master 'replikator@themis:3306' - retry-time: 60  retries: 86400
 Last_SQL_Errno: 0
 Last_SQL_Error:

On the slave hosts, the error logs end like this:

...
080324  1:39:05 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'FIRST' at postion 4
080324  1:39:05 [ERROR] Error reading packet from server: File './themis-bin.000001' not found (Errcode: 2) ( server_errno=29)
080324  1:40:05 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'FIRST' at postion 4
080324  1:40:05 [ERROR] Error reading packet from server: File './themis-bin.000001' not found (Errcode: 2) ( server_errno=29)
080324  1:41:05 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'FIRST' at postion 4
080324  1:41:05 [ERROR] Error reading packet from server: File './themis-bin.000001' not found (Errcode: 2) ( server_errno=29)

On the master host (themis), the error log ends like this:

...
080324  1:37:05 [ERROR] Failed to open log (file './themis-bin.000001', errno 2)
080324  1:38:05 [ERROR] Failed to open log (file './themis-bin.000001', errno 2)
080324  1:39:05 [ERROR] Failed to open log (file './themis-bin.000001', errno 2)
080324  1:40:05 [ERROR] Failed to open log (file './themis-bin.000001', errno 2)

On the master host, the themis-bin.index file ends like this:

...
/var/lib/mysql/themis-bin.000118
/var/lib/mysql/themis-bin.000119
./themis-bin.000120
./themis-bin.000121
./themis-bin.000122
./themis-bin.000123
./themis-bin.000124

Looks like someone has changed the format from absolute to relative paths.

What puzzles me is that the error log entry indicates that the slaves are trying to read the master's *very first* binlog file; I've removed many old binlog files a long time ago with no issues so far. Is that an unrelated bug?

How to repeat:
Upgrade a master-slave replication setup from 5.1.22 to 5.1.23 and see replication slaves stop replicating.

Suggested fix:
Looks like the format of the <host>-bin.index file entries should be changed back from relative to absolute path names.

Thank you for the report.

This incompatibility is result of bug #28597 and fix of it.

Before bug #28597 was introduced binlog index file contained relative paths like

./themis-bin.000074

In time when bug #28597 existed binlog index file containde absolute paths like:

/var/lib/mysql/themis-bin.000076

Bug #28597 was fixed in version 5.1.23, so binlog index file contains relative paths again.

So behavior is correct.

But manual does not have an alert about upgrade between these versions could lead to replication broke. So I set category of the report to "Documentation"

Thanks for turning this into a docs problem! ;-) However, I can't see what the fix is supposed to be. Or should we just state in the docs that you shouldn't upgrade because this will break replication? I don't think so. So please indicate what the docs should say to avoid the problem.

I think docs should alert:

upgrade master from earlier versions to 5.1.18-5.1.22 can lead to creation of invalid binary log index file

upgrade master from versions 5.1.18-5.1.22 to 5.1.23 can lead to creation of invalid binary log index file

Same for 5.0 and 6.0 branches.

Maybe some workaround like "use log-bin=explicit_name" or "reset master" should be indicated.

Maybe mention a workaround?? Because if there's no workaround I wouldn't consider this a documentation bug. Think of the consequences if there's no workaround: Users with a replication setup effectively can't update MySQL!

I am sorry, missed your workaround from bug #28597.

(Workaround: After manually changing the last line recorded by the "old" (5.1.16) server
from:

./themis-bin.000075

to:

/var/lib/mysql/themis-bin.000075

and restarting the slave server, replication resumes with no issues.)

So for this bug one should reverse execute this action: rename /var/lib/mysql/themis-bin.000075 to themis-bin.000075

The workaround I had indicated for the old bug doesn't work. Indeed, that was the first thing I tried, but to no avail.

See also my original remark about another (unrelated?) bug. From looking at the error look, I can see that the slaves are trying to read the *very first* binlog file from the master (that's gone away a long time ago). I can't say whether or not that's the reason why the workaround doesn't work this time but I think it is.

I'm afraid we need a different workaround this time - please indicate what that should be.

Stefan,

Did you remove the references to the log files that don't actually exist any longer from the index file?

Also, I'm not sure whether or not this is applicable, but you might want to look at http://dev.mysql.com/doc/refman/5.1/en/purge-master-logs.html - especially the last paragraph, added just a couple of days ago, and the bugs referenced therein (Bug #18199, Bug #18453) .

I couldn't make replication start again while I was on MySQL 5.1.23. Upgrading to 5.1.24 magically solved the problem: Without having done anything (except upgrading using MySQL's RPM files) replication started again, and slaves caught up to the master's current position as if the problem had never existed. I'm stunned.