Bug #68203 slave does not retry connect
Submitted: 28 Jan 2013 9:39 Modified: 30 Jan 2013 17:25
Reporter: 智史 三谷 Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.1.54, 5.0.97, 5.1.69, 5.5.31, 5.7.1 OS:Linux
Assigned to: CPU Architecture:Any
Triage: Needs Triage: D3 (Medium)

[28 Jan 2013 9:39] 智史 三谷
Description:
we got a below log. it seems that slave did not retry connect.

130122 16:52:51 [Note] Slave I/O thread: connected to master 'repl@masked-hostname:34038',replication started in log 'mysqld-bin.000052' at position 863638611
130125  1:49:35 [ERROR] Slave I/O thread couldn't register on master
130125  1:49:35 [Note] Slave I/O thread exiting, read up to log 'mysqld-bin.000064', position 103416485
130125 10:03:24 [Note] Slave I/O thread: connected to master 'repl@masked-hostname:34038',replication started in log 'mysqld-bin.000064' at position 103416485

my master-retry-count setting is 0. In manual, master-retry-count means "infinite".
i believe that below code cause this bug. if master-retry-count=0, it retry only once.

sql/slave.cc - try_to_reconnect:
  if ((*retry_count)++)
  {
    if (*retry_count > master_retry_count)  <--- here
      return 1;                             // Don't retry forever
    safe_sleep(thd, mi->connect_retry, (CHECK_KILLED_FUNC) io_slave_killed,
               (void *) mi);
  }

How to repeat:
hard to repeat. it depending on timing.
[29 Jan 2013 15:39] Sinisa Milivojevic
We need a fully repeatable test case, which requires that you provide us with  all data, no matter how voluminous  in order to be able to verify this bug.

Do not forget that master can be busy in which case a timeout should be increased.
[30 Jan 2013 2:12] 智史 三谷
ok.
when network get in trouble, this issue occur.
it is hard to reproduce network trouble.

i suggest you a patch to simulate network trouble.

$ diff -u slave.cc.org slave.cc
--- slave.cc.org        2013-01-30 10:43:01.000000000 +0900
+++ slave.cc    2013-01-30 10:51:08.000000000 +0900
@@ -1512,6 +1512,9 @@
   uint report_host_len, report_user_len=0, report_password_len=0;
   DBUG_ENTER("register_slave_on_master");

+  sql_print_error("#68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail");
+  DBUG_RETURN(1);
+
   *suppress_warnings= FALSE;
   if (!report_host)
     DBUG_RETURN(0);
@@ -2444,6 +2447,7 @@
                             uint *retry_count, bool suppress_warnings,
                             const char *messages[SLAVE_RECON_MSG_MAX])
 {
+  sql_print_error("#68203 DEBUG: try_to_reconnect, master_retry_count=%d, retry_count=%d", master_retry_count, *retry_count);
   mi->slave_running= MYSQL_SLAVE_RUN_NOT_CONNECT;
   thd->proc_info= messages[SLAVE_RECON_MSG_WAIT];
 #ifdef SIGNAL_WITH_VIO_CLOSE
@@ -2622,6 +2626,7 @@
       Register ourselves with the master.
     */
     thd_proc_info(thd, "Registering slave on master");
+    sql_print_error("#68203 DEBUG: Registering slave on master");
     if (register_slave_on_master(mysql, mi, &suppress_warnings))
     {
       if (!check_io_slave_killed(thd, mi, "Slave I/O thread killed "

Here is a debug log.
when master_retry_count=0, slave should be retry infinite.

master_retry_count=0:
Version: '5.1.54-log'  socket: '/vol//mysql/repltest/run/mysql.sock'  port: 54002  Source distribution
130130 11:05:38 [Note] Slave SQL thread initialized, starting replication in log 'mysqld-bin.000040' at position 517165, relay log '/vol/mysql/repltest/data/mysqld-relay-bin.000005' position: 252
130130 11:05:38 [Note] Slave I/O thread: connected to master 'repl@masked-hostname:50002',replication started in log 'mysqld-bin.000040' at position 519793
130130 11:05:38 [ERROR] #68203 DEBUG: Registering slave on master
130130 11:05:38 [ERROR] #68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail
130130 11:05:38 [ERROR] Slave I/O thread couldn't register on master
130130 11:05:38 [ERROR] #68203 DEBUG: try_to_reconnect, master_retry_count=0, retry_count=0
130130 11:05:38 [Warning] Slave I/O: Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again, log 'mysqld-bin.000040' at postion 519793, Error_code: 1597
130130 11:05:38 [ERROR] #68203 DEBUG: Registering slave on master
130130 11:05:38 [ERROR] #68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail
130130 11:05:38 [ERROR] Slave I/O thread couldn't register on master
130130 11:05:38 [ERROR] #68203 DEBUG: try_to_reconnect, master_retry_count=0, retry_count=1
130130 11:05:38 [Note] Slave I/O thread exiting, read up to log 'mysqld-bin.000040', position 519793

master_retry_count=5:
Version: '5.1.54-log'  socket: '/vol//mysql/repltest/run/mysql.sock'  port: 54002  Source distribution
130130 11:07:04 [Note] Slave I/O thread: connected to master 'repl@masked-hostname:50002',replication started in log 'mysqld-bin.000040' at position 519793
130130 11:07:04 [ERROR] #68203 DEBUG: Registering slave on master
130130 11:07:04 [ERROR] #68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail
130130 11:07:04 [ERROR] Slave I/O thread couldn't register on master
130130 11:07:04 [ERROR] #68203 DEBUG: try_to_reconnect, master_retry_count=5, retry_count=0
130130 11:07:04 [Warning] Slave I/O: Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again, log 'mysqld-bin.000040' at postion 519793, Error_code: 1597
130130 11:07:04 [ERROR] #68203 DEBUG: Registering slave on master
130130 11:07:04 [ERROR] #68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail
130130 11:07:04 [ERROR] Slave I/O thread couldn't register on master
130130 11:07:04 [ERROR] #68203 DEBUG: try_to_reconnect, master_retry_count=5, retry_count=1
130130 11:07:09 [Warning] Slave I/O: Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again, log 'mysqld-bin.000040' at postion 519793, Error_code: 1597
130130 11:07:09 [ERROR] #68203 DEBUG: Registering slave on master
130130 11:07:09 [ERROR] #68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail
130130 11:07:09 [ERROR] Slave I/O thread couldn't register on master
130130 11:07:09 [ERROR] #68203 DEBUG: try_to_reconnect, master_retry_count=5, retry_count=2
130130 11:07:14 [Warning] Slave I/O: Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again, log 'mysqld-bin.000040' at postion 519793, Error_code: 1597
130130 11:07:14 [ERROR] #68203 DEBUG: Registering slave on master
130130 11:07:14 [ERROR] #68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail
130130 11:07:14 [ERROR] Slave I/O thread couldn't register on master
130130 11:07:14 [ERROR] #68203 DEBUG: try_to_reconnect, master_retry_count=5, retry_count=3
130130 11:07:19 [Warning] Slave I/O: Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again, log 'mysqld-bin.000040' at postion 519793, Error_code: 1597
130130 11:07:19 [ERROR] #68203 DEBUG: Registering slave on master
130130 11:07:19 [ERROR] #68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail
130130 11:07:19 [ERROR] Slave I/O thread couldn't register on master
130130 11:07:19 [ERROR] #68203 DEBUG: try_to_reconnect, master_retry_count=5, retry_count=4
130130 11:07:24 [Warning] Slave I/O: Master command COM_REGISTER_SLAVE failed: failed registering on master, reconnecting to try again, log 'mysqld-bin.000040' at postion 519793, Error_code: 1597
130130 11:07:24 [ERROR] #68203 DEBUG: Registering slave on master
130130 11:07:24 [ERROR] #68203 DEBUG: register_slave_on_master simulate ER_NET_READ_INTERRUPTED fail
130130 11:07:24 [ERROR] Slave I/O thread couldn't register on master
130130 11:07:24 [ERROR] #68203 DEBUG: try_to_reconnect, master_retry_count=5, retry_count=5
130130 11:07:24 [Note] Slave I/O thread exiting, read up to log 'mysqld-bin.000040', position 519793
[30 Jan 2013 17:25] Sveta Smirnova
Thank you for the report.

Verified as described.