| Bug #56416 | errors in first connection for IO thread are not retried | ||
|---|---|---|---|
| Submitted: | 31 Aug 2010 16:51 | Modified: | 29 Nov 2010 22:45 |
| Reporter: | Mark Callaghan | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Replication | Severity: | S3 (Non-critical) |
| Version: | 5.1.47 | OS: | Any |
| Assigned to: | Daogang Qu | CPU Architecture: | Any |
| Tags: | connect, error, replication | ||
[31 Aug 2010 17:25]
Sveta Smirnova
Thank you for the report. Verified as described: problem is number of retries which is not 86400 for just few seconds.
[17 Sep 2010 10:29]
Daogang Qu
It is not a bug.
From the following code, we can see the mysql_real_connect(...) fun will be
try to execute 'mi->retry_count' times if the connection always failed.
And by default the mi->retry_count is set to 3600*24 (== 86400), which
verifys that it really try to connect for 86400 times, just the error
infomation will be reported in the first time.
while (!(slave_was_killed = io_slave_killed(thd,mi)) &&
(reconnect ? mysql_reconnect(mysql) != 0 :
mysql_real_connect(mysql, mi->host, mi->user, mi->password, 0,
mi->port, 0, client_flag) == 0))
{
/* Don't repeat last error */
if ((int)mysql_errno(mysql) != last_errno ||
DBUG_EVALUATE_IF("connect_to_master_always_report_error",
TRUE, FALSE))
{
/*
TODO: would be great that when issuing SHOW SLAVE STATUS
the number of retries would actually show err_count
instead of mi->retry_count.
We can achieve that if we remove the 'if' above and
replace mi->retry_count with err_count. However, we
would be adding an entry in the error log for each
connection attempt (ie, for each retry).
*/
last_errno=mysql_errno(mysql);
suppress_warnings= 0;
mi->report(ERROR_LEVEL, last_errno,
"error %s to master '%s@%s:%d'"
" - retry-time: %d retries: %lu",
(reconnect ? "reconnecting" : "connecting"),
mi->user, mi->host, mi->port,
mi->connect_retry,
DBUG_EVALUATE_IF("connect_to_master_always_report_error",
err_count+1,
mi->retry_count));
}
/*
By default we try forever. The reason is that failure will trigger
master election, so if the user did not set mi->retry_count we
do not want to have election triggered on the first failure to
connect
*/
if (++err_count == mi->retry_count)
{
slave_was_killed=1;
break;
}
safe_sleep(thd,mi->connect_retry,(CHECK_KILLED_FUNC)io_slave_killed,
(void*)mi);
}
{"master-retry-count", 0,
"The number of tries the slave will make to connect to the master before giving up.",
&master_retry_count, &master_retry_count, 0, GET_ULONG,
REQUIRED_ARG, 3600*24, 0, 0, 0, 0, 0},
[17 Sep 2010 16:50]
Mark Callaghan
Yes, it is not a bug but a feature request as the IO thread is stuck in the connection retry loop mostly sleeping. I think that SHOW SLAVE STATUS could do a better job of describing that.
*************************** 1. row ***************************
Slave_IO_State: Connecting to master
Master_Host: foobar
Master_User: root
Master_Port: 3306
Connect_Retry: 60
Master_Log_File:
Read_Master_Log_Pos: 4
Relay_Log_File: dev347-relay-bin.000001
Relay_Log_Pos: 4
Relay_Master_Log_File:
Slave_IO_Running: No
Slave_SQL_Running: Yes
Replicate_Do_DB:
Replicate_Ignore_DB:
Replicate_Do_Table:
Replicate_Ignore_Table:
Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
Last_Errno: 0
Last_Error:
Skip_Counter: 0
Exec_Master_Log_Pos: 0
Relay_Log_Space: 212
Until_Condition: None
Until_Log_File:
Until_Log_Pos: 0
Master_SSL_Allowed: No
Master_SSL_CA_File:
Master_SSL_CA_Path:
Master_SSL_Cert:
Master_SSL_Cipher:
Master_SSL_Key:
Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 2005
Last_IO_Error: error connecting to master 'root@foobar:3306' - retry-time: 60 retries: 86400
Last_SQL_Errno: 0
Last_SQL_Error:
1 row in set (0.00 sec)
[20 Sep 2010 3:12]
Zhenxing He
Hi Mark, Slave_IO_State shows status 'Connecting to master', and also for 5.5, the Slave_IO_Running will show 'Connecting' instead of 'No' when slave IO thread is trying to (re-)connecting the master (BUG#41613). So I think by checking these fields, you can know that the slave is connecting. I'm not sure if this can meet your needs.
[20 Sep 2010 6:37]
Mark Callaghan
I am not sure that this needs to be changed. However, I missed what was going on and the person who reported it to me missed what was going on. At this point, I am tainted. Maybe you can ask one or two other people whether they can diagnose the current that of the slave when this problem occurs.
[20 Sep 2010 6:39]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/118550 3273 Dao-Gang.Qu@sun.com 2010-09-20 Bug #56416 errors in first connection for IO thread are not retried SHOW SLAVE STATUS can not display the number of real retries for each connection attempt of I/O thread. To fix the problem to make SHOW SLAVE STATUS to display the number of retries which would be real retry counts instead of mi->retry_count(3600*24 by default) for each connection attempt of I/O thread. @ mysql-test/suite/rpl/r/rpl_change_master_dbug.result Updated for the patch of BUG #56416 @ mysql-test/suite/rpl/t/rpl_change_master_dbug.test Updated test case to also verifies that the 'Last_IO_Error' entry of SHOW SLAVE STATUS will display the number of real retries. @ sql/rpl_slave.cc Added code to make SHOW SLAVE STATUS to display the number of real retries for each connection attempt of I/O thread.
[26 Sep 2010 23:58]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/119130 3299 Dao-Gang.Qu@sun.com 2010-09-27 Bug #56416 errors in first connection for IO thread are not retried SHOW SLAVE STATUS can not display the number of real retries for each connection attempt of I/O thread by 'Last_IO_Error' entry. To fix the problem to make SHOW SLAVE STATUS to display the number of retries which would be real retry counts instead of mi->retry_count(3600*24 by default) for each connection attempt of I/O thread. @ mysql-test/suite/rpl/r/rpl_change_master_dbug.result Updated for the patch of BUG #56416 @ mysql-test/suite/rpl/t/rpl_change_master_dbug.test Updated test case to also verifies that the 'Last_IO_Error' entry of SHOW SLAVE STATUS will display the number of real retries. @ sql/rpl_slave.cc Added code to make SHOW SLAVE STATUS to display the number of real retries for each connection attempt of I/O thread.
[2 Oct 2010 18:15]
Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101002181053-6iotvl26uurcoryp) (version source revid:alexander.nozdrin@oracle.com-20101002180917-h0n62akupm3z20nt) (pib:21)
[13 Nov 2010 16:21]
Bugs System
Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:vasil.dimov@oracle.com-20100629074804-359l9m9gniauxr94) (merge vers: 5.6.99-m4) (pib:21)
[29 Nov 2010 22:45]
Jon Stephens
Documented bugfix in the 5.6.1 changelog as follows:
SHOW SLAVE STATUS now displays the actual number of retries for
each connection attempt made by the I/O thread.
Closed.

Description: The IO thread does not retry the first connection it makes to the master. This connection is used to setup replication. I realize that the retries done afterwards are different than a retry done here. However, I still think that some retries should be done in this case. Otherwise, a DBA must monitor restarts done when the network is flaky because of packet loss or other errors. Also, the error message printed in this case is misleading as no retries have been done. An example is below. This is the same error message that would be printed were there failures after 86400 retries. A better error message would state that the initial connection could not be made. 100831 9:17:07 [Note] Slave SQL thread initialized, starting replication in log 'binary-logs.004056' at position 319321951 and 319321951, relay log '/var/lib/mysql/mysqld-relay-bin.008793' position: 128818774 100831 9:18:07 [ERROR] Slave I/O thread: error connecting to master 'X@Y:3306': Error: 'Can't connect to MySQL server on 'X' (4)' errno: 2003 retry-time: 10 retries: 86400 How to repeat: Read the code. At the start of handle_slave_io there is a call to safe_connect. And safe_connect calls connect_to_master with reconnect=0. From handle_slave_io: // we can get killed during safe_connect if (!safe_connect(thd, mysql, mi)) { sql_print_information("Slave I/O thread: connected to master '%s@%s:%d'," "replication started in log '%s' at position %s", mi->user, mi->host, mi->port, IO_RPL_LOG_NAME, llstr(mi->master_log_pos,llbuff)); /* Adding MAX_LOG_EVENT_HEADER_LEN to the max_packet_size on the I/O thread, since a replication event can become this much larger than the corresponding packet (query) sent from client to master. */ mysql->net.max_packet_size= thd->net.max_packet_size+= MAX_LOG_EVENT_HEADER; } else { sql_print_information("Slave I/O thread killed while connecting to master"); goto err; } From safe_connect: static int safe_connect(THD* thd, MYSQL* mysql, Master_info* mi) { DBUG_ENTER("safe_connect"); DBUG_RETURN(connect_to_master(thd, mysql, mi, 0, 0)); } From connect_to_master: while (!(slave_was_killed = io_slave_killed(thd,mi)) && (reconnect ? mysql_reconnect(mysql) != 0 : mysql_real_connect(mysql, mi->host, mi->user, mi->password, 0, mi->port, 0, client_flag) == 0)) { /* Don't repeat last error */ if ((int)mysql_errno(mysql) != last_errno) { last_errno=mysql_errno(mysql); suppress_warnings= 0; mi->report(ERROR_LEVEL, last_errno, "error %s to master '%s@%s:%d'" " - retry-time: %d retries: %lu", (reconnect ? "reconnecting" : "connecting"), mi->user, mi->host, mi->port, mi->connect_retry, master_retry_count); } Suggested fix: Make the error message better. Indicate that no retries have been done. Consider supporting a few retries in this case.