MySQL Bugs: #43460: STOP SLAVE unable to complete when slave thread is trying to reconnect to master

Bug #43460	STOP SLAVE unable to complete when slave thread is trying to reconnect to master
Submitted:	6 Mar 2009 16:49	Modified:	23 Jan 2012 15:22
Reporter:	Jasmin Letendre	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.1.30, 5.1.32	OS:	Windows
Assigned to:	Sven Sandberg	CPU Architecture:	Any
Tags:	Delay, start slave, stop slave, timeout

Description:
I have a basic replication setup with one master and one slave. Every time I try to stop the slave while the master is down, it takes a 1 minute timeout before STOP SLAVE returns.

I confirmed using show processlist that one thread is stuck in state 'Killing slave' while the slave thread is in state 'Reconnecting after a failed master event read'. Here is the complete output.

mysql> show processlist;
+----+-------------+----------------+-------+---------+------+----------------------------
-------------------------------------------+------------------+
| Id | User        | Host           | db    | Command | Time | State                      
                                          | Info             |
+----+-------------+----------------+-------+---------+------+----------------------------
-------------------------------------------+------------------+
|  3 | root        | localhost:4337 | mysql | Query   |   16 | Killing slave              
                                          | stop slave       |
|  4 | root        | localhost:4338 | mysql | Sleep   |    2 |                            
                                          | NULL             |
|  7 | root        | localhost:4345 | NULL  | Sleep   | 1188 |                            
                                          | NULL             |
| 17 | system user |                | NULL  | Connect |  110 | Reconnecting after a failed
master event read                         | NULL             |
| 18 | system user |                | NULL  | Connect |  109 | Has read all relay log;
waiting for the slave I/O thread to update it | NULL             |
| 19 | root        | localhost:4598 | NULL  | Query   |    0 | NULL                       
                                          | show processlist |
+----+-------------+----------------+-------+---------+------+----------------------------
-------------------------------------------+------------------+

I had originally entered this information in bug #31024 which was marked as closed and fixed in 5.1.24. It seems this bug is still there in 5.1.30 on Windows.(mysqld  Ver 5.1.30-community-log for Win32 on
ia32 (MySQL Community Server (GPL)). I executed the same test on Red Hat Enterprise Linux and the bug is not present on that platform.

How to repeat:
-make sure the master is down
-(on the slave)
   mysql> start slave;
   Query OK, 0 rows affected (0.00 sec)

   mysql> stop slave;
   Query OK, 0 rows affected (57.78 sec)

Suggested fix:
Not familiar enough with the code to provide something more precise than: stop trying to reconnect to master when STOP SLAVE is called.

Thank you for the report.

I can not repeat delay provided. Also version 5.1.30 as a bit old. Please upgrade to current version 5.1.32, try with it and if problem still exists provide output of SHOW SLAVE STATUS \G before running STOP SLAVE

Thanks for the quick response!

I have updated to the latest release (5.1.32) and the problem is still there.
Here are my logs. You'll see that I have call 'show process list' and 'show slave status' twice. The first pair is before I shutdown my master server. The second pair is after the shutdown.

mysql> start slave;
Query OK, 0 rows affected (0.00 sec)

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.0.0.140
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: JAZZMAN2-bin.000113
          Read_Master_Log_Pos: 16258
               Relay_Log_File: JAZZMAN2-relay-bin.000003
                Relay_Log_Pos: 254
        Relay_Master_Log_File: JAZZMAN2-bin.000113
             Slave_IO_Running: Yes
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 16258
              Relay_Log_Space: 15636
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: 0
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 0
               Last_SQL_Error:
1 row in set (0.00 sec)

mysql> show processlist \G
*************************** 1. row ***************************
     Id: 8
   User: root
   Host: localhost:3514
     db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: show processlist
*************************** 2. row ***************************
     Id: 9
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 13
  State: Waiting for master to send event
   Info: NULL
*************************** 3. row ***************************
     Id: 10
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 13
  State: Has read all relay log; waiting for the slave I/O thread to update it
   Info: NULL
3 rows in set (0.00 sec)

mysql> show slave status \G
*************************** 1. row ***************************
               Slave_IO_State: Reconnecting after a failed master event read
                  Master_Host: 10.0.0.140
                  Master_User: root
                  Master_Port: 3306
                Connect_Retry: 60
              Master_Log_File: JAZZMAN2-bin.000113
          Read_Master_Log_Pos: 16258
               Relay_Log_File: JAZZMAN2-relay-bin.000003
                Relay_Log_Pos: 254
        Relay_Master_Log_File: JAZZMAN2-bin.000113
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 16258
              Relay_Log_Space: 15636
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 2003
                Last_IO_Error: error reconnecting to master 'root@10.0.0.140:3306' - retry-time: 60  retries: 86400
               Last_SQL_Errno: 0
               Last_SQL_Error:
1 row in set (0.00 sec)

mysql> show processlist \G
*************************** 1. row ***************************
     Id: 8
   User: root
   Host: localhost:3514
     db: NULL
Command: Query
   Time: 0
  State: NULL
   Info: show processlist
*************************** 2. row ***************************
     Id: 9
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 45
  State: Reconnecting after a failed master event read
   Info: NULL
*************************** 3. row ***************************
     Id: 10
   User: system user
   Host:
     db: NULL
Command: Connect
   Time: 45
  State: Has read all relay log; waiting for the slave I/O thread to update it
   Info: NULL
3 rows in set (0.00 sec)

mysql> stop slave;
Query OK, 0 rows affected (32.69 sec)

I couldn't repeat this issue too on Windows Vista x64 running latest source servers.

Hi,

I've tried on 3 different machines. 2 were running WinXP 64 bits and 1 was running WinXP 32 bits. They had the same problem. Is it possible that the problem is related to the way I setup replication.

To enable replication, I only changed my config file (that was generated by the MySQL Server Instance Configuration Wizard) by adding the following 2 lines:
   server-id=1
   log-bin=JAZZMAN2_D-bin

I then setup replication live by executing the following actions:
   - mysqldump --user=root --password=root --host=localhost --port=3307 --single-transaction --master-data=1 my_db > my_db_dump.sql
   - CHANGE MASTER TO MASTER_HOST='localhost', MASTER_PORT=3307, MASTER_USER='root', MASTER_PASSWORD='root';
   - mysql --user=root --password=root --host=localhost --port=3306 my_db < my_db_dump.sql
   - START SLAVE;

I have verified that replication is working correctly up to the point where I stop the master server.

Just in case it could be useful, here is my config file content (I removed the comments to make it shorter).

--- Start of config file
[client]
port=3306

[mysql]
default-character-set=latin1
[mysqld]
port=3306
basedir="C:/Program Files (x86)/MySQL/MySQL Server 5.1/"
datadir="C:/Documents and Settings/All Users/Application Data/MySQL/MySQL Server 5.1/Data/"
default-character-set=latin1
default-storage-engine=INNODB
sql-mode="STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"
max_connections=160
query_cache_size=31M
table_cache=320
tmp_table_size=35M
thread_cache_size=8

myisam_max_sort_file_size=100G
myisam_sort_buffer_size=51M
key_buffer_size=47M
read_buffer_size=64K
read_rnd_buffer_size=256K
sort_buffer_size=256K

innodb_additional_mem_pool_size=2M
innodb_flush_log_at_trx_commit=1
innodb_log_buffer_size=1M
innodb_buffer_pool_size=91M
innodb_log_file_size=19M
innodb_thread_concurrency=10

server-id=1
log-bin=JAZZMAN2_D-bin
--- End of config file

This is the vise versa to bug #43217.

Unfortunately the damage is much higher when we would let stop the slave immediately. Data lost would be the follow.

If you want to get an immediate stop of the slave please use KILL.

Will you have same problem when you are using KILL?

Using KILL has exactly the same behaviour as using STOP SLAVE in my case.

I don't think my bug is the vice versa to bug #43217. #43217 seems to be about waiting for a whole transaction to replicate before stopping. For my bug, there is nothing at all being replicated. The STOP SLAVE is not waiting on some replicating statement, but rather on the network timeout of the slave IO thread trying to reconnect to the master server.

It seems to me like if the Slave IO thread has called connect() synchronously with a timeout of around 1 minute to reconnect to the master and this call is not interrupted when we issue STOP SLAVE. Is it possible that something like that is going on?

Which setup do you using?

a) master and slave on same machine also mysql CLI on same machine
b) master and slave on different machines and where did you run CLI?
c) master and slave on different machines by using different OS and where did you run CLI?

Did you connect by using "localhost" or by using IP-Adress "127.0.0.1"? Or how did you connect ...
a) to master
b) to slave
c) between master and slave

a)b)c) I have tested pretty much all the combinations and they all have the problem, as long as the DB servers are on Windows.
For the DB server, I have tried:
-both master and slave on same Windows host (bug)
-master and slave on different Windows host (bug)
-master and slave on same RedHat host (works well)
-master and slave on different RedHat hosts (works well)
-master and slave on same CentOS host (works well)
-master and slave on different CentOS hosts (works well)

For the CLI, I have tried (all of them from a Windows host):
-mysql command line tool (bug)
-mysql C client library (bug)
-MyODBC (bug)
-QueryBrowser (bug)

The slave is always connecting to the master using the master's IP address. The client connects to the server using either localhost or the server's IP address.

Thank you for the feedback.

Verified as described:

host:~/sveta/mysql-5.1.36-win32> ./bin/mysqld --defaults-file=my-small.ini --basedir=. --datadir=./data_slave/ --log-error --port=33512 --socket=/tmp/mysql_ssmirnova_slave.sock --server_id=2 --relay-log=slave-relay &
[2] 3528
host:~/sveta/mysql-5.1.36-win32> ./bin/mysql -P33512 -uroot
change master to master_port=3333, master_user='repl', master_password='replrepl', master_='local';
start slave;
stop slave;
select now();
start slave;
select now();
stop slave;
select now();
\q
now()
2009-07-13 08:02:39
now()
2009-07-13 08:02:39
now()
2009-07-13 08:03:40

With master-connect-retry=3 it runs forever.

Any update on this? Any chance this will get fixed in a subsequent release of 5.1?

* bump *

Seems to be experiencing similar behavior (master not available, long (>5min wait times))

+-------------------------+----------------------+
| Variable_name           | Value                |
+-------------------------+----------------------+
| version                 | 5.1.37-2~bpo50+1-log |
| version_comment         | (Debian)             |
| version_compile_machine | i486                 |
| version_compile_os      | debian-linux-gnu     |
+-------------------------+----------------------+

I can reproduce this behavior on MySQL 5.0.58.

I looked into this in detail with Debugger and Debug version of MySQL.
This block took a time. (terminate_slave_thread() function in slave.cc)

while (*slave_running)			// Should always be true
  {
    DBUG_PRINT("loop", ("killing slave thread"));
    KICK_SLAVE(thd);
    /*
      There is a small chance that slave thread might miss the first
      alarm. To protect againts it, resend the signal until it reacts
    */
    struct timespec abstime;
    set_timespec(abstime,2);
    pthread_cond_timedwait(term_cond, cond_lock, &abstime);
  }

See also BUG#58546

After fixing BUG#58546 , I check the behavior under MySQL 5.5.13 on Win64.

But the behavior is same as before.

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented as follows in the 5.5.21 changelog:

        On Windows slave hosts, STOP SLAVE took a overly long time to 
        complete when the master was down.

Closed.

Thanks! Will test as soon as the package is out.

Fix also documented in the 5.6.5 changelog. Status unchanged.