MySQL Bugs: #9726: Wrong position of replication

Bug #9726	Wrong position of replication
Submitted:	7 Apr 2005 16:40	Modified:	9 May 2006 17:33
Reporter:	Zbynek Helcl	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S2 (Serious)
Version:	4.1.9	OS:	WXP SP2
Assigned to:	Paul DuBois	CPU Architecture:	Any

Description:
I've got an error with replication - master went off because of power outage and SLAVE wrote himself wrong position of last reading. Correct possition was 248713, but after reconnecting slave tried to read possition 3771214.
This is first time of this strange behavior, but even so it may repeat any time again in the future.

This is part of SLAVE log-file.

050406 21:58:27 [Note] Slave SQL thread initialized, starting replication in log 'orion-bin.000020' at position 248713, relay log '.\martin-relay-bin.000060' position: 5661749
050406 21:58:31 [Note] Slave I/O thread: connected to master 'rapl@10.0.0.3:3306',  replication started in log 'orion-bin.000020' at position 248713
050407 16:49:24 [ERROR] Error reading packet from server: Lost connection to MySQL server during query (server_errno=2013)
050407 16:49:24 [ERROR] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'orion-bin.000020' position 3771214
050407 16:49:24 [Note] Slave: connected to master 'rapl@10.0.0.3:3306',replication resumed in log 'orion-bin.000020' at position 3771214
050407 16:49:24 [ERROR] Error reading packet from server: Client requested master to start replication from impossible position (server_errno=1236)
050407 16:49:24 [ERROR] Got fatal error 1236: 'Client requested master to start replication from impossible position' from master when reading data from binary log
050407 16:49:24 [ERROR] Slave I/O thread exiting, read up to log 'orion-bin.000020', position 3771214

How to repeat:
I'm not sure how to repeat, it was first time I got this error.

Looking on master server gave me some new information - orion-bin.000020's size is 3771120B, so mistake is NOT at SLAVE, but at MASTER's filesystem (NTFS).

Not enough information was provided for us to be able
to handle this bug. Please re-read the instructions at
http://bugs.mysql.com/how-to-report.php

If you can provide more information, feel free to add it
to this bug and change the status back to 'Open'.

Thank you for your interest in MySQL.

Hi...

I have the same problem, but I don't know how to repeat it, and it is not the first time.

my version of the MySQL is 4.1.10 run in GNU/Linux [slave] and MSW 2000 [Master].

the error:
--
050505 10:05:15 [Note] Slave SQL thread initialized, starting replication in log 'bin.000018' at position 292006661, relay log './relay-bin.000004' position: 4
050505 10:05:16 [Note] Slave I/O thread: connected to master 'user@xxx.xxx.xxx.xxx:3306',  replication started in log 'bin.000018' at position 292006661
050505 10:05:18 [ERROR] Error reading packet from server: Client requested master to start replication from impossible position (server_errno=1236)
050505 10:05:18 [ERROR] Got fatal error 1236: 'Client requested master to start replication from impossible position' from master when reading data from binary log
050505 10:05:18 [ERROR] Slave I/O thread exiting, read up to log 'bin.000018', position 292006661
050505 10:07:14 [ERROR] Error reading relay log event: slave SQL thread was killed
--

My file "my.cnf" in the SLAVE :
--
[mysqld]
server-id=2
datadir=/var/lib/mysql
socket=/var/lib/mysql/mysql.sock
master-host=xxx.xxx.xxx.xxx
master-user=user
master-password=pass
replicate-do-table=db.t1
replicate-do-table=db.t2
replicate-do-table=db.t3
replicate-do-table=db.t4
slave-skip-errors=all
master-connect-retry=1
slave_net_timeout=120
                                                                                                                                                           
[mysql.server]
user=mysql
basedir=/var/lib
                                                                                                                                                           
[safe_mysqld]
err-log=/var/log/mysqld.log
pid-file=/var/run/mysqld/mysqld.pid
--

I believe that this behaviour happen when the server change at other bin-log-file ! but I am not sure.

I update the last version of MySLQ and I will probed if this work fine.

---
thanks for the attention lent...

Does your error also occur afrer master server failure and reboot? What is the size of 'bin.000018' at master? What file system do you use?
Seems like bad filesize reported from filesystem...

HI...

I sorry very much for the long time for this reply BUT I can't do it before...

[quote]
Does your error also occur afrer master server failure and reboot? 
[/quote]

The master I think, is work fine and I don't need reboot...

[quote]
What is the size of 'bin.000018' at master? 
[/quote]

280 MB

[quote]
What file system do you use?
[/quote]

NTFS...

---
thanks for the attention lent...

I've run into this too.  My thought is that the binlog is getting sent over the network to the slave, but the the local disk write to the binlog on the master doesn't get flushed before the power goes down.  Then, when the power comes back up, the binlog on the master is missing a "block" of events, which the slave (whose power didn't go down and thus did flush the contents of the relay-log + master.info) has.  When the slave reconnects, it then uses the position it knows it read up until, but the master never saved it.

I tried reading up a bt on the internals to see if I could confirm this in the source cde, but I think I got a bit lost - on the slave side, I see there are 2 loops in the I/O thread - the first reads existing binary logs, and the second waits for individual events.  It's the second loop that I'm worried about, and I'm not sure whether the master sends these events from a memory buffer ("on-the-fly", cached fromthe IO_CACHE, etc - which would agree with my suspicion), or from a physical read from the binlog (which would prove that I'm barking up the wrong tree)

I think that it might be enough to send the slave a log_rotation event to fix it - since the binary logs on the master get rotated at startup (which we're guaranteed to have if it went down hard), we can safely ignore the rest of the old binlog file [best case scenario: we were at the end of the file anyway and could have rolled over; worst case scenario: there's a partially written entry at the end if there was a flush in process when the system went down, leaving a corrupted binary file, which we're now safely ignoring], and keep the master and slave in sync (assuming we didn't lose table data in the same crash - not sure if table data gets flushed with the binlog data or on its own).  The only catch is, we still have possibly corrupt binary logs left on the master side which will be unfit for recreating an up-to-date snapshot of the master data...

Thoughts?

See http://forums.mysql.com/read.php?26,43384,43384

Just a quick update - after looking through the documentation to find a way to flush the binlog buffer more often (and skimming over the solution many times without even realizing what I was seeing) I eventually found the sync_binlog option.  In general, I've always found it difficult to find specific options in the docs (although that's a bit beyond the scope of this bug entry) - in this specific case, however, it might have helped to mention this option somewhere in the chapter on replication - perhaps in the features and bugs section just before the point about innodb-safe-binlog.

I still think that some "opt-in" way to allow the slave to auto-recover as I wrote in my previous comment would be helpful.

changed to "documentation problem", sync_binlog should be made more obvious

Thank you for your bug report. This issue has been addressed in the
documentation. The updated documentation will appear on our website
shortly, and will be included in the next release of the relevant
product(s).

Additional info:

I've added a note to http://dev.mysql.com/doc/refman/5.0/en/replication-features.html
describing the role of setting sync_binlog=1 as a means of reducing the possibility
of getting a mismatch between master and slave due to a master crash.

I also had this problem!

I fixed it by syncing the time and date of the servers... There was a difference of 1 hour...