MySQL Bugs: #56416: errors in first connection for IO thread are not retried

Bug #56416	errors in first connection for IO thread are not retried
Submitted:	31 Aug 2010 16:51	Modified:	29 Nov 2010 22:45
Reporter:	Mark Callaghan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.1.47	OS:	Any
Assigned to:	Daogang Qu	CPU Architecture:	Any
Tags:	connect, error, replication

Description:
The IO thread does not retry the first connection it makes to the master. This connection is used to setup replication. I realize that the retries done afterwards are different than a retry done here. However, I still think that some retries should be done in this case. Otherwise, a DBA must monitor restarts done when the network is flaky because of packet loss or other errors.

Also, the error message printed in this case is misleading as no retries have been done. An example is below. This is the same error message that would be printed were there failures after 86400 retries. A better error message would state that the initial connection could not be made.

100831  9:17:07 [Note] Slave SQL thread initialized, starting replication in log 'binary-logs.004056' at position 319321951 and 319321951, relay log '/var/lib/mysql/mysqld-relay-bin.008793' position: 128818774
100831  9:18:07 [ERROR] Slave I/O thread: error connecting to master 'X@Y:3306': Error: 'Can't connect to MySQL server on 'X' (4)'  errno: 2003  retry-time: 10  retries: 86400

How to repeat:
Read the code. At the start of handle_slave_io there is a call to safe_connect. And safe_connect calls connect_to_master with reconnect=0.

From handle_slave_io:

  // we can get killed during safe_connect
  if (!safe_connect(thd, mysql, mi))
  {
    sql_print_information("Slave I/O thread: connected to master '%s@%s:%d',"
                          "replication started in log '%s' at position %s",
                          mi->user, mi->host, mi->port,
                          IO_RPL_LOG_NAME,
                          llstr(mi->master_log_pos,llbuff));
  /*
    Adding MAX_LOG_EVENT_HEADER_LEN to the max_packet_size on the I/O
    thread, since a replication event can become this much larger than
    the corresponding packet (query) sent from client to master.
  */
    mysql->net.max_packet_size= thd->net.max_packet_size+= MAX_LOG_EVENT_HEADER;
  }
  else
  {
    sql_print_information("Slave I/O thread killed while connecting to master");
    goto err;
  }

From safe_connect:

static int safe_connect(THD* thd, MYSQL* mysql, Master_info* mi)
{
  DBUG_ENTER("safe_connect");

  DBUG_RETURN(connect_to_master(thd, mysql, mi, 0, 0));
}

From connect_to_master:

  while (!(slave_was_killed = io_slave_killed(thd,mi)) &&
         (reconnect ? mysql_reconnect(mysql) != 0 :
          mysql_real_connect(mysql, mi->host, mi->user, mi->password, 0,
                             mi->port, 0, client_flag) == 0))
  {
    /* Don't repeat last error */
    if ((int)mysql_errno(mysql) != last_errno)
    {
      last_errno=mysql_errno(mysql);
      suppress_warnings= 0;
      mi->report(ERROR_LEVEL, last_errno,
                 "error %s to master '%s@%s:%d'"
                 " - retry-time: %d  retries: %lu",
                 (reconnect ? "reconnecting" : "connecting"),
                 mi->user, mi->host, mi->port,
                 mi->connect_retry, master_retry_count);
    }

Suggested fix:
Make the error message better. Indicate that no retries have been done.

Consider supporting a few retries in this case.

Thank you for the report.

Verified as described: problem is number of retries which is not 86400 for just few seconds.

It is not a bug.

From the following code, we can see the mysql_real_connect(...) fun will be
try to execute 'mi->retry_count' times if the connection always failed.
And by default the mi->retry_count is set to 3600*24 (== 86400), which
verifys that it really try to connect for 86400 times, just the error
infomation will be reported in the first time.

while (!(slave_was_killed = io_slave_killed(thd,mi)) &&
         (reconnect ? mysql_reconnect(mysql) != 0 :
          mysql_real_connect(mysql, mi->host, mi->user, mi->password, 0,
                             mi->port, 0, client_flag) == 0))
  {
    /* Don't repeat last error */
    if ((int)mysql_errno(mysql) != last_errno || 
        DBUG_EVALUATE_IF("connect_to_master_always_report_error", 
                         TRUE, FALSE))
    {
      /*
        TODO: would be great that when issuing SHOW SLAVE STATUS
              the number of retries would actually show err_count
              instead of mi->retry_count.

              We can achieve that if we remove the 'if' above and 
              replace mi->retry_count with err_count. However, we
              would be adding an entry in the error log for each 
              connection attempt (ie, for each retry).
       */
      last_errno=mysql_errno(mysql);
      suppress_warnings= 0;
      mi->report(ERROR_LEVEL, last_errno,
                 "error %s to master '%s@%s:%d'"
                 " - retry-time: %d  retries: %lu",
                 (reconnect ? "reconnecting" : "connecting"),
                 mi->user, mi->host, mi->port,
                 mi->connect_retry, 
                 DBUG_EVALUATE_IF("connect_to_master_always_report_error", 
                                  err_count+1, 
                                  mi->retry_count));
    }
    /*
      By default we try forever. The reason is that failure will trigger
      master election, so if the user did not set mi->retry_count we
      do not want to have election triggered on the first failure to
      connect
    */
    if (++err_count == mi->retry_count)
    {
      slave_was_killed=1;
      break;
    }
    safe_sleep(thd,mi->connect_retry,(CHECK_KILLED_FUNC)io_slave_killed,
               (void*)mi);
  }

  {"master-retry-count", 0,
   "The number of tries the slave will make to connect to the master before giving up.",
   &master_retry_count, &master_retry_count, 0, GET_ULONG,
   REQUIRED_ARG, 3600*24, 0, 0, 0, 0, 0},

Yes, it is not a bug but a feature request as the IO thread is stuck in the connection retry loop mostly sleeping. I think that SHOW SLAVE STATUS could do a better job of describing that. 

*************************** 1. row ***************************
               Slave_IO_State: Connecting to master
                  Master_Host: foobar
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: 
          Read_Master_Log_Pos: 4
               Relay_Log_File: dev347-relay-bin.000001
                Relay_Log_Pos: 4
        Relay_Master_Log_File: 
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 0
                   Last_Error: 
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 0
              Relay_Log_Space: 212
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2005
                Last_IO_Error: error connecting to master 'root@foobar:3306' - retry-time: 60  retries: 86400
               Last_SQL_Errno: 0
               Last_SQL_Error: 
1 row in set (0.00 sec)

Hi Mark,

Slave_IO_State shows status 'Connecting to master', and also for 5.5, the Slave_IO_Running will show 'Connecting' instead of 'No' when slave IO thread is trying to (re-)connecting the master (BUG#41613). So I think by checking these fields, you can know that the slave is connecting. I'm not sure if this can meet your needs.

I am not sure that this needs to be changed. However, I missed what was going on and the person who reported it to me missed what was going on. At this point, I am tainted. Maybe you can ask one or two other people whether they can diagnose the current that of the slave when this problem occurs.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/118550

3273 Dao-Gang.Qu@sun.com	2010-09-20
      Bug #56416    errors in first connection for IO thread are not retried
      
      SHOW SLAVE STATUS can not display the number of real retries
      for each connection attempt of I/O thread.
      
      To fix the problem to make SHOW SLAVE STATUS to display the
      number of retries which would be real retry counts instead of
      mi->retry_count(3600*24 by default) for each connection
      attempt of I/O thread.
     @ mysql-test/suite/rpl/r/rpl_change_master_dbug.result
        Updated for the patch of BUG #56416
     @ mysql-test/suite/rpl/t/rpl_change_master_dbug.test
        Updated test case to also verifies that the 'Last_IO_Error'
        entry of SHOW SLAVE STATUS will display the number of real
        retries.
     @ sql/rpl_slave.cc
        Added code to make SHOW SLAVE STATUS to display the number of
        real retries for each connection attempt of I/O thread.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/119130

3299 Dao-Gang.Qu@sun.com	2010-09-27
      Bug #56416    errors in first connection for IO thread are not retried
      
      SHOW SLAVE STATUS can not display the number of real retries
      for each connection attempt of I/O thread by 'Last_IO_Error'
      entry.
      
      To fix the problem to make SHOW SLAVE STATUS to display the
      number of retries which would be real retry counts instead of
      mi->retry_count(3600*24 by default) for each connection
      attempt of I/O thread.
     @ mysql-test/suite/rpl/r/rpl_change_master_dbug.result
        Updated for the patch of BUG #56416
     @ mysql-test/suite/rpl/t/rpl_change_master_dbug.test
        Updated test case to also verifies that the 'Last_IO_Error'
        entry of SHOW SLAVE STATUS will display the number of real
        retries.
     @ sql/rpl_slave.cc
        Added code to make SHOW SLAVE STATUS to display the number of
        real retries for each connection attempt of I/O thread.

Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101002181053-6iotvl26uurcoryp) (version source revid:alexander.nozdrin@oracle.com-20101002180917-h0n62akupm3z20nt) (pib:21)

Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:vasil.dimov@oracle.com-20100629074804-359l9m9gniauxr94) (merge vers: 5.6.99-m4) (pib:21)

Documented bugfix in the 5.6.1 changelog as follows:

        SHOW SLAVE STATUS now displays the actual number of retries for
        each connection attempt made by the I/O thread.

Closed.