Bug #41400 slave fails to reconnect on errors
Submitted: 11 Dec 2008 15:53 Modified: 31 May 2009 6:07
Reporter: Mark Callaghan Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.0.67 OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: reconnect, replication, slave

[11 Dec 2008 15:53] Mark Callaghan
Description:
Read handle_slave_io. Any place where it has 'goto err' stops the IO thread and doesn't attempt to reconnect on possibly transient errors. 

Potential problems are:
* call to get_master_version_and_clock. This runs several queries on the master (see calls to mysql_real_query). If any of them fail the IO thread stops.

* call to register_slave_on_master. This runs a command on the master. If that fails then the IO thread. Yet this call has a very funny incorrect comment --  'If fails, this is not fatal - we just print the error message and go on with life.'

I prefer not to report the same problem multiple times.

Related bugs for this are:
http://bugs.mysql.com/bug.php?id=21132
http://bugs.mysql.com/bug.php?id=30814
http://bugs.mysql.com/bug.php?id=19175
http://bugs.mysql.com/bug.php?id=11923

How to repeat:
Read the code, add query failures in debug mode to a few more places in slave.cc to show that the code doesn't reconnect when it should.

Suggested fix:
Add code to reconnect to the master
[11 Dec 2008 16:54] Sveta Smirnova
Thank you for the report.
[12 Dec 2008 21:27] Andrei Elkin
After talking to Sinisa we came to consensus, 
sinisa said: errors could occur due to network problems 
... how about trying restart after a sleep().

Indeed, the 2nd of the mentioned functions register_slave_on_master() can return
with an error of transient character allowing to restart automatically upon a timeout. 
Wrt other two sub-issues of the description:
1. errors of get_master_version_and_clock() are all of a critical character and
   the slave can not restart.
2. the misleading comment has been removed in 5.1.
[12 Dec 2008 21:42] Mark Callaghan
Some errors from get_master_version_and_clock are transient and may be caused by a flaky network. That function runs several queries on the master. Any of them can fail because of a flaky network. Reconnect must be retried in that case.
[22 May 2009 11:03] Zhenxing He
Hi,

I think this problem does not exist in 5.1+.

1) get_master_version_and_clock() only returns 1 when the queries are successful but the result of the queries are not as expected. So it should not suffer from any flaky network problems

2) the problem with register_slave_on_master() has already been fixed by BUG#29976.

If no objection, I'd like to mark this bug as a dup of BUG#29976.

Please provide feedback if you disagree, thanks!
[25 May 2009 9:33] Zhenxing He
Dup of BUG#29976
[25 May 2009 14:36] Mark Callaghan
I am not fond of the fix for get_master_version_and_clock.

get_master_version_and_clock doesn't report an error when queries on the master fail. Instead it makes up values to use or doesn't do the error checks. If that is OK to do, then why not get rid of this code?
[26 May 2009 3:21] Zhenxing He
Hi Mark,

I agree that get_master_version_and_clock should not ignore errors of queries. But since this issue is different from what this bug report originally reported, I'd like to open a new bug to handle this problem, is that OK?
[31 May 2009 6:07] Zhenxing He
Handle the get_master_version_and_clock problem by Bug#45214, close this bug as dup