MySQL Bugs: #41400: slave fails to reconnect on errors

Bug #41400	slave fails to reconnect on errors
Submitted:	11 Dec 2008 15:53	Modified:	31 May 2009 6:07
Reporter:	Mark Callaghan	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.67	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	reconnect, replication, slave

Description:
Read handle_slave_io. Any place where it has 'goto err' stops the IO thread and doesn't attempt to reconnect on possibly transient errors. 

Potential problems are:
* call to get_master_version_and_clock. This runs several queries on the master (see calls to mysql_real_query). If any of them fail the IO thread stops.

* call to register_slave_on_master. This runs a command on the master. If that fails then the IO thread. Yet this call has a very funny incorrect comment --  'If fails, this is not fatal - we just print the error message and go on with life.'

I prefer not to report the same problem multiple times.

Related bugs for this are:
http://bugs.mysql.com/bug.php?id=21132
http://bugs.mysql.com/bug.php?id=30814
http://bugs.mysql.com/bug.php?id=19175
http://bugs.mysql.com/bug.php?id=11923

How to repeat:
Read the code, add query failures in debug mode to a few more places in slave.cc to show that the code doesn't reconnect when it should.

Suggested fix:
Add code to reconnect to the master

Thank you for the report.

After talking to Sinisa we came to consensus, 
sinisa said: errors could occur due to network problems 
... how about trying restart after a sleep().

Indeed, the 2nd of the mentioned functions register_slave_on_master() can return
with an error of transient character allowing to restart automatically upon a timeout. 
Wrt other two sub-issues of the description:
1. errors of get_master_version_and_clock() are all of a critical character and
   the slave can not restart.
2. the misleading comment has been removed in 5.1.

Some errors from get_master_version_and_clock are transient and may be caused by a flaky network. That function runs several queries on the master. Any of them can fail because of a flaky network. Reconnect must be retried in that case.

Hi,

I think this problem does not exist in 5.1+.

1) get_master_version_and_clock() only returns 1 when the queries are successful but the result of the queries are not as expected. So it should not suffer from any flaky network problems

2) the problem with register_slave_on_master() has already been fixed by BUG#29976.

If no objection, I'd like to mark this bug as a dup of BUG#29976.

Please provide feedback if you disagree, thanks!

Dup of BUG#29976

I am not fond of the fix for get_master_version_and_clock.

get_master_version_and_clock doesn't report an error when queries on the master fail. Instead it makes up values to use or doesn't do the error checks. If that is OK to do, then why not get rid of this code?

Hi Mark,

I agree that get_master_version_and_clock should not ignore errors of queries. But since this issue is different from what this bug report originally reported, I'd like to open a new bug to handle this problem, is that OK?

Handle the get_master_version_and_clock problem by Bug#45214, close this bug as dup