Bug #2886 | IO-Thread stops working at illegal master log position | ||
---|---|---|---|
Submitted: | 19 Feb 2004 2:29 | Modified: | 8 Nov 2004 10:34 |
Reporter: | Anders Henke | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 4.0.17 | OS: | Linux (GNU/Linux) |
Assigned to: | Guilhem Bichot | CPU Architecture: | Any |
[19 Feb 2004 2:29]
Anders Henke
[19 Feb 2004 3:15]
Guilhem Bichot
Hi, Yes, we would appreciate to have a compressed archive (zip, tar.gz, tar.bz2) of the error log and binlog.2473. I intend to look into all those displayed positions. Please if possible put it in ftp://support.mysql.com/pub/mysql/secret/ (this is an area readable only by the developers of MySQL AB) under name, say, bug2886.proper_extension. Thank you! Guilhem
[19 Feb 2004 3:19]
Anders Henke
bug2886.tar.gz containing mysql.err.1 and binlog.2473 uploaded to ftp://support.mysql.com/pub/mysql/secret/
[19 Feb 2004 8:07]
Guilhem Bichot
Thanks! I will inspect them.
[20 Feb 2004 6:02]
David Newcomb
During this problem starting with slave with "START SLAVE" does not throw any errors. Even if you explicitly try to start the slave IO thread it still reports sucess. mysql> show slave status\G Slave_IO_Running: No Slave_SQL_Running: No mysql> start slave; Query OK, 0 rows affected (0.01 sec) mysql> show slave status\G Slave_IO_Running: No Slave_SQL_Running: Yes mysql> start slave io_thread; Query OK, 0 rows affected (0.02 sec) mysql> show slave status\G Slave_IO_Running: No
[20 Feb 2004 6:04]
David Newcomb
Sorry - 4.0.16 / Win2000
[22 Feb 2004 2:38]
Guilhem Bichot
I have updated BUG#1858 (which you had reported) which could be involved here (as replication problems seem to appear roughly at the time of REPAIR/OPTIMIZE TABLE). However, there is probably a 2nd bug involved (because stopping cleanly the slave threads, which is what BUG#1858 was doing, is not a reason for corrupting relay logs) that I will study. But in your case the latest fix for 1858 could make the 2nd bug not pop.
[28 Feb 2004 15:13]
Guilhem Bichot
Hi, > 040219 3:41:22 Error running query, slave SQL thread aborted. Fix the problem, > and restart the slave SQL thread with "SLAVE START". We stopped at log > 'binlog.2473' position 705697 > When manually viewing the master's binlog with mysqlbinlog, there is no such log > position 705697 (but 705670 and 705773 do exist and are dated 3:40:07). Yes, this is the first symptom of the problem, which explains all failing attempts which your automated script made. Exec_master_log_pos=705697 is a wrong position so the script could not succeed by using this wrong value. I have looked at your binlog but have not finished. A couple of questions: - did this start happening (or frequency increased) after an upgrade? - does it still occur?
[29 Feb 2004 14:49]
Guilhem Bichot
Here are the final conclusions on this issue: - slave crashed. Which means that the end of the relay log, which was in memory, not on disk, was lost. In master.info it said that the slave I/O thread had read up to position 710852 in the binlog, but in fact, the relay log ended at around 705500; the last statement in it was truncated (again because the end of this statement was in memory, not on disk; for speed the relay log code uses some caching, which has a safety drawback of course). - slave restarted. The slave SQL thread restarted from position 617772 in the relay log (which was also 617772 in the binlog). It replayed statements until it finds the truncated one around 705500: then it says "read error", "relay log corrupted" and stops, which is all normal. Note that 705500 < 710852, which is one more hint about the fact that it was a truncated statement of before the server's crash. - Unfortunately (and that's the only unexpected thing) the error message contained position 705697 which does not exist in the binlog, thus preventing 'repfix' to resolve the problem. So why this weird position... It corresponds to the position of the BEGIN (705517) plus the length of the INSERT statement at position 706051 (180)... The only reasons I could come up with are: - was the slave's server id changed around the time of crash, or the time of restart? Could it have been changed to 84? - was the SQL_SLAVE_SKIP_COUNTER variable used around these moments? - was the slave (even briefly) started with --skip-innodb at some moment? I also slightly changed a bit of code in 4.0.19, you may want to give it a try on your slave when it is released (not before several weeks probably). And the unexpected stops of replication caused by OPTIMIZE TABLE / REPAIR TABLE are anyway fixed in 4.0.19 too. Good luck!
[29 Feb 2004 15:40]
Guilhem Bichot
By the way, I was surprised by this line 040219 3:40:58 Slave SQL thread initialized, starting replication in log 'binlog.2473' at position 617772, relay log './rdb88-relay-bin.065' position: 617772 Usually the binlog position and relay log position do not match that exactly (because the relay log usually contains a few more events (one additional Rotate event everytime the slave I/O thread connects to the master for example) than the binlog). Of course, it can happen that the 2 positions are equal at some point. Just not seen that often.
[29 Feb 2004 23:41]
Anders Henke
On one slave I've manually installed 4.0.18 (binaries taken from the mysql.com-RPMs) on one host, but that didn't solve that issue (as the changelogs don't mention any issue close to this one, I would've been quite surprised if 4.0.18 would fix this case ...). Hopefully 4.0.19 does resolve both this bug as well as #1858. Ok, back to the questions from 28/29th: The problem doesn't occur more often after an upgrade, it does occur every night on at least one (different) slave, sometimes on up to three at the same time. At the moment it doesn't occur - but I've also commented out the daily runs of OPTIMIZE/REPAIR for the slave hosts, so the problem seems to be with those administrative commands. Repfix doesn't (yet) play around with SQL_SLAVE_SKIP_COUNTER. As the problems do appear in the middle of the night and I'm one of the two local gurus both investigating the problem as well as managing those hosts, I also doubt that someone is manually changing server id or serversettings. As the "valuable" databases on the slaves are close to 100% InnoDB, not anyone here came close to the idea of even suggesting to use skip-innodb, so that should also be out of the question :)
[29 Feb 2004 23:45]
Anders Henke
I've also experienced the "start slave doesn't give errors" reported by David Newcomb on Feb 20th when manually getting the servers to replicate again. I thought of 'start slave' as a command which only schedules the start of the two slave threads and returns "ok" when that scheduling has succeeded, so I didn't think of mentioning this ...
[1 Mar 2004 1:34]
Guilhem Bichot
Hi, If 4.0.19 resolves this bug, let's hope so. It is a good idea to add this to repfix, for future troubleshooting: before RESET SLAVE, save the relay logs somewhere (they are invaluable for debugging what happened on the slave). About "start slave doesn't give errors" reported by David Newcomb on Feb 20th: you are just right, 'start slave' is a command which only schedules the start of the two slave threads and returns "ok" when that scheduling has succeeded. It would report if threads failed to start, but in your case the slave SQL thread started, opened the relay log, tried to read an event from the relay log and then stopped. So it is not a bug, just a behaviour which surprises people sometimes (and changing this behaviour would probably break some existing scripts of users, so can't be done now).
[1 Mar 2004 1:37]
Guilhem Bichot
When you say "The problem doesn't occur more often after an upgrade, it does occur every night on at least one (different) slave, sometimes on up to three at the same time.", which problem exactly? Do you mean the position is wrong and 'repfix' can't restart replication, or just that replication stops and can be successfully restarted (which is BUG#1858).
[1 Mar 2004 2:56]
Anders Henke
"problem doesn't occur more often after an upgrade" was somehow misunderstanding; it referse to Bug 2886 and means that the problem doesn't appear more often after an upgrade than usually and the problem doesn't seem to be related to an upgrade.
[1 Mar 2004 4:16]
Guilhem Bichot
This: "the Exec_Master_Log_Pos position is wrong and so 'repfix' FAILS TO restart replication": does it occur every night or so, on several slaves, or much more rarely ?
[1 Mar 2004 5:41]
Anders Henke
It doesn't happen every night on all slaves, but it shows up about every night on at least one of the slaves, so with still about a probability of 10-20% (regarding our setup).
[19 May 2004 13:59]
Michael Schmidt
I had the same problem and after updating the slave to 4.0.20 it disappeared.
[8 Nov 2004 10:34]
Guilhem Bichot
Hi Anders, I'm closing this bug as it hasn't had any feedback for 6 months. If you have done more tests with the latest versions (4.0 or even 4.1 which is now production), feel free to mention them here and reopen the bug. Thanks!