Bug #2886 IO-Thread stops working at illegal master log position
Submitted: 19 Feb 2004 2:29 Modified: 8 Nov 2004 10:34
Reporter: Anders Henke Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:4.0.17 OS:Linux (GNU/Linux)
Assigned to: Guilhem Bichot CPU Architecture:Any

[19 Feb 2004 2:29] Anders Henke
Description:
This bugreport replaces report #2884.

Upon restarting a crashed/killed (exact circumstances yet unknown) replication slave, mysql restarted the replication jobs and about half a minute later stopped its IO-Thread at a position which cannot be found in the master's binlog, provoking a "strange" error message:

[...]
InnoDB: In a MySQL replication slave the last master binlog file
InnoDB: position 0 617772, file name binlog.2473
InnoDB: Last MySQL binlog file position 0 2967730, file name /db/logs/binlog.067
040219  3:40:43  InnoDB: Flushing modified pages from the buffer pool...
040219  3:40:58  InnoDB: Started
/usr/sbin/mysqld: ready for connections.
Version: '4.0.17-standard-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306
040219  3:40:58  Slave SQL thread initialized, starting replication in log 'binlog.2473' at position 617772, relay log './rdb88-relay-bin.065' position: 617772
040219  3:40:58  Slave I/O thread: connected to master 'repl@rdb84.schlund.de:3306',  replication started in log 'binlog.2473' at position 710852
040219  3:41:22  Error in Log_event::read_log_event(): 'read error', data_len: 158, event_type: 2
040219  3:41:22  Error reading relay log event: slave SQL thread aborted because of I/O error
040219  3:41:22  Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0
040219  3:41:22  Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.2473' position 705697

When manually viewing the master's binlog with mysqlbinlog, there is no such log position 705697 (but 705670 and 705773 do exist and are dated 3:40:07).

Every 10 Minutes (so at 3:50), a cronjobs checks wether replication is fine.
If any of IO or SQL-Thread aren't running and the last error message looks like the one above, this 'repfix'-cronjob does 
-STOP SLAVE
-read the last Exec_master_log_pos and Relay_Master_log_File from 
 'show slave status',
-performs a RESET SLAVE and then 
-'CHANGE master to' to point to the same file and log position 
 obtained earlier from the 'show slave status'.
This is done to automatically fix e.g. a crashed relaylog or similar errors to the one described here.

The mysql-errorlog from that time does look much more strange:

040219  3:50:01  Slave SQL thread initialized, starting replication in log 'binlog.2473' at position 705697, relay log './rdb88-relay-bin.065' position: 706231
040219  3:50:01  Error in Log_event::read_log_event(): 'read error', data_len: 158, event_type: 2
040219  3:50:01  Error reading relay log event: slave SQL thread aborted because of I/O error
040219  3:50:01  Slave: Could not parse relay log event entry. [...]
040219  3:50:01  Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.2473' position 705697
040219  3:50:01  Slave I/O thread killed while reading event
040219  3:50:01  Slave I/O thread exiting, read up to log 'binlog.2473', position 980512
040219  3:50:01  Slave SQL thread initialized, starting replication in log 'binlog.2473' at position 705697, relay log './rdb88-relay-bin.001' position: 4
040219  3:50:01  Slave I/O thread: connected to master 'repl@rdb84.schlund.de:3306',  replication started in log 'binlog.2473' at position 705697
040219  3:50:01  Error reading packet from server: log event entry exceeded max_allowed_packet; Increase max_allowed_packet on master (server_errno=1236)
040219  3:50:01  Got fatal error 1236:

The IO-Thread says that it could read up to 980512 - which doesn't fit
the position 705697 read by 'show slave status' from repfix (can be seen
in 'replication started in log ...') as well as in the errorlog from 3:41.
repfix tried to reset to 705697 for a few hours, obviously without success.

I manually set the slave back to 617772 this morning, and so replayed all transactions (the applications writing data to the database are written in
a way to allow this); the replication then successfully replayed the last few hours of master binlogs without any error at all.

How to repeat:
Unsure.

Similar errors breaking replication do happen on one or more random slaves in a replication setup of 12 hosts between around 3:40am; it doesn't happen every day, but at least three times a week at our site. Most can be automatically fixed (see 'repfix'-description above), but this one this morning has been quite hard.

At the time of occurence, a 
"/usr/bin/mysqlcheck --repair --all-databases --medium-check --silent" ran:
this usually does take about 20 Minutes and started at 3:30. 99% of data in 
the affected databases are stored on InnoDB tables, including a few tables in the range of 4-5GB.
After the --repair, /usr/bin/mysqlcheck --optimize --all-databases --silent
is being run. This does take just a few seconds, as optimize shouldn't affect
InnoDB-tables that much ...

The affected hosts are running on 2G of ECC-Memory, and still do have swap space available; the problems only occured on the slave hosts so far, never on the master.

Suggested fix:
Unknown.
A copy of the full mysql-errorlog and a copy of master's binlog.2473 have been taken and are held available on request for mysql AB developers. The stored data contains somehow private information, so public posting of binlog.2473 is not possible.
[19 Feb 2004 3:15] Guilhem Bichot
Hi,
Yes, we would appreciate to have a compressed archive (zip, tar.gz, tar.bz2) of the error log and binlog.2473. I intend to look into all those displayed positions.
Please if possible put it in ftp://support.mysql.com/pub/mysql/secret/
(this is an area readable only by the developers of MySQL AB) under name, say, bug2886.proper_extension.

Thank you!
Guilhem
[19 Feb 2004 3:19] Anders Henke
bug2886.tar.gz containing mysql.err.1 and binlog.2473 uploaded to ftp://support.mysql.com/pub/mysql/secret/
[19 Feb 2004 8:07] Guilhem Bichot
Thanks! I will inspect them.
[20 Feb 2004 6:02] David Newcomb
During this problem starting with slave with "START SLAVE" does
not throw any errors. Even if you explicitly try to start the
slave IO thread it still reports sucess.

mysql> show slave status\G
     Slave_IO_Running: No
    Slave_SQL_Running: No
mysql> start slave;
Query OK, 0 rows affected (0.01 sec)
mysql> show slave status\G
     Slave_IO_Running: No
    Slave_SQL_Running: Yes
mysql> start slave io_thread;
Query OK, 0 rows affected (0.02 sec)
mysql> show slave status\G
     Slave_IO_Running: No
[20 Feb 2004 6:04] David Newcomb
Sorry - 4.0.16 / Win2000
[22 Feb 2004 2:38] Guilhem Bichot
I have updated BUG#1858 (which you had reported) which could be involved here (as replication problems seem to appear roughly at the time of REPAIR/OPTIMIZE TABLE). However, there is probably a 2nd bug involved (because stopping cleanly the slave threads, which is what BUG#1858 was doing, is not a reason for corrupting relay logs) that I will study. But in your case the latest fix for 1858 could make the 2nd bug not pop.
[28 Feb 2004 15:13] Guilhem Bichot
Hi,

> 040219  3:41:22  Error running query, slave SQL thread aborted. Fix the problem,
> and restart the slave SQL thread with "SLAVE START". We stopped at log
> 'binlog.2473' position 705697
> When manually viewing the master's binlog with mysqlbinlog, there is no such log
> position 705697 (but 705670 and 705773 do exist and are dated 3:40:07).

Yes, this is the first symptom of the problem, which explains all failing attempts which your automated script made. Exec_master_log_pos=705697 is a wrong position so the script could not succeed by using this wrong value.
I have looked at your binlog but have not finished. A couple of questions:
- did this start happening (or frequency increased) after an upgrade?
- does it still occur?
[29 Feb 2004 14:49] Guilhem Bichot
Here are the final conclusions on this issue:
- slave crashed. Which means that the end of the relay log, which was in memory, not on disk, was lost. In master.info it said that the slave I/O thread had read up to position 710852 in the binlog, but in fact, the relay log ended at around 705500; the last statement in it was truncated (again because the end of this statement was in memory, not on disk; for speed the relay log code uses some caching, which has a safety drawback of course).
- slave restarted. The slave SQL thread restarted from position 617772 in the relay log (which was also 617772 in the binlog). It replayed statements until it finds the truncated one around 705500: then it says "read error", "relay log corrupted" and stops, which is all normal. Note that 705500 < 710852, which is one more hint about the fact that it was a truncated statement of before the server's crash.
- Unfortunately (and that's the only unexpected thing) the error message contained position 705697 which does not exist in the binlog, thus preventing 'repfix' to resolve the problem.
So why this weird position... It corresponds to the position of the BEGIN (705517) plus the length of the INSERT statement at position 706051 (180)... The only reasons I could come up with are:
- was the slave's server id changed around the time of crash, or the time of restart? Could it have been changed to 84?
- was the SQL_SLAVE_SKIP_COUNTER variable used around these moments?
- was the slave (even briefly) started with --skip-innodb at some moment?
I also slightly changed a bit of code in 4.0.19, you may want to give it a try on your slave when it is released (not before several weeks probably).
And the unexpected stops of replication caused by OPTIMIZE TABLE / REPAIR TABLE are anyway fixed in 4.0.19 too.
Good luck!
[29 Feb 2004 15:40] Guilhem Bichot
By the way, I was surprised by this line
040219  3:40:58  Slave SQL thread initialized, starting replication in log
'binlog.2473' at position 617772, relay log './rdb88-relay-bin.065' position:
617772
Usually the binlog position and relay log position do not match that exactly (because the relay log usually contains a few more events (one additional Rotate event everytime the slave I/O thread connects to the master for example) than the binlog). Of course, it can happen that the 2 positions are equal at some point. Just not seen that often.
[29 Feb 2004 23:41] Anders Henke
On one slave I've manually installed 4.0.18 (binaries taken from the mysql.com-RPMs) on one host, but that didn't solve that issue (as the 
changelogs don't mention any issue close to this one, I would've been quite surprised if 4.0.18 would fix this case ...). Hopefully 4.0.19 does resolve both this bug as well as #1858.

Ok, back to the questions from 28/29th:
The problem doesn't occur more often after an upgrade, it does occur every
night on at least one (different) slave, sometimes on up to three at the
same time.
At the moment it doesn't occur - but I've also commented out the daily runs of 
OPTIMIZE/REPAIR for the slave hosts, so the problem seems to be with those 
administrative commands.

Repfix doesn't (yet) play around with SQL_SLAVE_SKIP_COUNTER.
As the problems do appear in the middle of the night and I'm one of the two
local gurus both investigating the problem as well as managing those hosts, 
I also doubt that someone is manually changing server id or serversettings.
As the "valuable" databases on the slaves are close to 100% InnoDB, not 
anyone here came close to the idea of even suggesting to use skip-innodb,
so that should also be out of the question :)
[29 Feb 2004 23:45] Anders Henke
I've also experienced the "start slave doesn't give errors" reported by David Newcomb on Feb 20th when manually getting the servers to replicate again.
I thought of 'start slave' as a command which only schedules the start of the two slave threads and returns "ok" when that scheduling has succeeded, so I didn't think of mentioning this ...
[1 Mar 2004 1:34] Guilhem Bichot
Hi,
If 4.0.19 resolves this bug, let's hope so. It is a good idea to add this to repfix, for future troubleshooting: before RESET SLAVE, save the relay logs somewhere (they are invaluable for debugging what happened on the slave).
About "start slave doesn't give errors" reported by David Newcomb on Feb 20th: you are just right, 'start slave' is a command which only schedules the start of the two slave threads and returns "ok" when that scheduling has succeeded. It would report if threads failed to start, but in your case the slave SQL thread started, opened the relay log, tried to read an event from the relay log and then stopped. So it is not a bug, just a behaviour which surprises people sometimes (and changing this behaviour would probably break some existing scripts of users, so can't be done now).
[1 Mar 2004 1:37] Guilhem Bichot
When you say "The problem doesn't occur more often after an upgrade, it does occur every night on at least one (different) slave, sometimes on up to three at the
same time.", which problem exactly? Do you mean the position is wrong and 'repfix' can't restart replication, or just that replication stops and can be successfully restarted (which is BUG#1858).
[1 Mar 2004 2:56] Anders Henke
"problem doesn't occur more often after an upgrade" was somehow misunderstanding; it referse to Bug 2886 and means that the problem doesn't appear more often after an upgrade than usually and the problem doesn't seem to be related to an upgrade.
[1 Mar 2004 4:16] Guilhem Bichot
This:
"the Exec_Master_Log_Pos position is wrong and so 'repfix' FAILS TO restart replication":
does it occur every night or so, on several slaves, or much more rarely ?
[1 Mar 2004 5:41] Anders Henke
It doesn't happen every night on all slaves, but it shows up about every night on at least one of the slaves, so with still about a probability of 10-20% (regarding our setup).
[19 May 2004 13:59] Michael Schmidt
I had the same problem and after updating the slave to 4.0.20 it disappeared.
[8 Nov 2004 10:34] Guilhem Bichot
Hi Anders,
I'm closing this bug as it hasn't had any feedback for 6 months.
If you have done more tests with the latest versions (4.0 or even 4.1 which is now production), feel free to mention them here and reopen the bug.
Thanks!