Bug #44830 SLAVE START no longer results in error if RESTORE is running on master
Submitted: 12 May 2009 17:04 Modified: 16 Sep 2009 15:48
Reporter: Chuck Bell Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:6.0.11 OS:Any
Assigned to: Libing Song CPU Architecture:Any

[12 May 2009 17:04] Chuck Bell
Description:
I think something has changed in the server WRT slave connections. It used to be the case that when a slave connected it resulted in the binlog_send() method being called. However, this is no longer the case. 

The binlog_send() method is where we placed code to check that there isn't a restore running on the master and if there is, we issue an error to the slave and thus the START SLAVE command returns an error.

Now, since the binlog_send() method isn't getting called when a slave connects to a master that has a restore in progress, no error message is printed and the SLAVE START succeeds. It isn't until after the START SLAVE is completed that the slave requests information from the master resulting in a request_dump() call which does fire binlog_send() but too late to capture the error. In this case, the slave does stop but the error does not show in SHOW SLAVE STATUS, SHOW ERRORS, SHOW WARNINGS, or anywhere else other than the slave's console messages. This is clearly wrong and needs to be fixed.

How to repeat:
You can see the behavior using the rpl_backup_block.test file. It has several sections commented out until this bug can be fixed.

Suggested fix:
Unknown
[12 May 2009 17:36] Chuck Bell
Test is in mysql-6.0-backup tree.

An alternative way to reproduce the problem is:
* Setup a master and a slave using --console but do not connect the slave.  
* Run a backup (of any database).
* Use a debugger and set a breakpoint in the master server
  kernel.cc @225 : res= context.do_restore(overwrite);
* Connect the slave while the master is paused at the breakpoint.
* Observe the START SLAVE completes without errors.
* Resume the restore on the master.
* Do anything to prompt the slave to fetch from the master.
* Observer error appears in slave's console but not in the slave client.
* Attempt SHOW ERRORS, etc. on the slave. 
* Observe error does not show anywhere.
[12 May 2009 17:40] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73852

2809 Chuck Bell	2009-05-12
      BUG#44830 : SLAVE START no longer results in error if RESTORE is running on master
      
      Disabled portions of rpl_backup test because the slave no longer
      returns an error when a restore is in progress. This is a change
      in the way the slaves connect to the master and must be fixed.
      It has broken the ability to block slave connections (via START
      SLAVE) while a restore is in progress.
      modified:
        mysql-test/suite/rpl/r/rpl_backup_block.result
        mysql-test/suite/rpl/t/rpl_backup_block.cnf
        mysql-test/suite/rpl/t/rpl_backup_block.test
[12 May 2009 17:42] Chuck Bell
Previous patch is to disable affected portions of the test.
[12 May 2009 17:51] Chuck Bell
CORRECTION 

An alternative way to reproduce the problem is:
* Setup a master and a slave using --console but do not connect the slave.  
* Run a backup (of any database).
* Use a debugger and set a breakpoint in the master server in
  kernel.cc @225 : res= context.do_restore(overwrite);
* On the master, run the restore of the database backed up previously 
  (use OVERWRITE).
* Allow code to break at the breakpoint.
* Connect the slave while the master is paused at the breakpoint.
* Observe the START SLAVE completes without errors.
* Resume the restore on the master.
* Do anything to prompt the slave to fetch from the master.
* Observe error appears in slave's console but not in the slave client.
* Attempt SHOW ERRORS, etc. on the slave. 
* Observe error does not show anywhere.
[12 May 2009 18:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/73857

2809 Chuck Bell	2009-05-12
      BUG#44830 : SLAVE START no longer results in error if RESTORE is running on master
      
      Disabled portions of rpl_backup test because the slave no longer
      returns an error when a restore is in progress. This is a change
      in the way the slaves connect to the master and must be fixed.
      It has broken the ability to block slave connections (via START
      SLAVE) while a restore is in progress.
      
      Previously, when a slave attempted to connect to a master that
      had a restore in progress, the START SLAVE command would fail and
      an error would be sent to the client. Now, the command succeeds and
      no error is sent to the client. Since this test relies on detecting
      the error, it fails when run against the latest code.
      
      It is likely the mechanism for how the slave connects and/or the 
      sequence of events for detecting errors has changed thereby 
      causing the slave to delay the detection of being blocked by a
      restore run on the master.
      modified:
        mysql-test/suite/rpl/r/rpl_backup_block.result
        mysql-test/suite/rpl/t/rpl_backup_block.cnf
        mysql-test/suite/rpl/t/rpl_backup_block.test
[20 Aug 2009 10:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/81150

2859 Li-Bing.Song@sun.com	2009-08-20
      BUG#44830 SLAVE START no longer results in error if RESTORE is running on master
      
      In fact, We can not result in an error of START SLAVE command. 
      START SLAVE exits successfully as soon as I/O thread and SQL thread are started.
      It does not wait I/O thread to connect to master.
      Slave does not know status of master include if RESTORE command is running in master, 
      before I/O thread has connected to master.
      
      I/O thread sends binlog request to master then waits to receive something.
      Master receives the request and sends an error "ER_MASTER_BLOCKING_SLAVES" to slave if RESTORE command is running.
      I just wrote code to report the error and exit when slave receives an error "ER_MASTER_BLOCKING_SLAVES".
[20 Aug 2009 10:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/81151

2859 Li-Bing.Song@sun.com	2009-08-19
      BUG#44830 SLAVE START no longer results in error if RESTORE is running on master
      
      In fact, We can not result in an error of START SLAVE command. 
      START SLAVE exits successfully as soon as I/O thread and SQL thread are started.
      It does not wait I/O thread to connect to master.
      Slave does not know status of master include if RESTORE command is running in master, 
      before I/O thread has connected to master.
      
      I/O thread sends binlog request to master then waits to receive something.
      Master receives the request and sends an error "ER_MASTER_BLOCKING_SLAVES" to slave if RESTORE command is running.
      I just wrote code to report the error and exit when slave receives an error "ER_MASTER_BLOCKING_SLAVES".
[3 Sep 2009 7:47] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82281

2859 Li-Bing.Song@sun.com	2009-09-03
      BUG#44830 SLAVE START no longer results in error if RESTORE is running on master
      
      In fact, We can not result in an error of START SLAVE command. 
      START SLAVE exits successfully as soon as I/O thread and SQL thread are started.
      It does not wait I/O thread to connect to master.
      Slave does not know status of master include if RESTORE command is running in master, 
      before I/O thread has connected to master.
      
      When a slave requests binlog dump from a master, it will send
      an ER_MASTER_BLOCKING_SLAVES error to the slave and then stop the
      connection if RESTORE command is running on it. The slave must report
      an error and then stop the I/O thread after it recieves the error from the master.
      This patch wrote code to report the error and then stop I/O thread
      when slave receives an error "ER_MASTER_BLOCKING_SLAVES".
[5 Sep 2009 9:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82523

2859 Li-Bing.Song@sun.com	2009-09-05
      BUG#44830 SLAVE START no longer results in error if RESTORE is running on master
      
      In fact, We can not result in an error of START SLAVE command. 
      START SLAVE exits successfully as soon as I/O thread and SQL thread are started.
      It does not wait I/O thread to connect to master.
      Slave does not know status of master include if RESTORE command is running in master, 
      before I/O thread has connected to master.
      
      When a slave requests binlog dump from a master, it will send
      an ER_MASTER_BLOCKING_SLAVES error to the slave and then stop the
      connection if RESTORE command is running on it. The slave must report
      an error and then stop the I/O thread after it recieves the error from the master.
      This patch wrote code to report the error and then stop I/O thread
      when slave receives an error "ER_MASTER_BLOCKING_SLAVES".
[5 Sep 2009 9:23] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82524

2810 Li-Bing.Song@sun.com	2009-09-05
      BUG#44830 SLAVE START no longer results in error if RESTORE is running on master
      
      In fact, We can not result in an error of START SLAVE command. 
      START SLAVE exits successfully as soon as I/O thread and SQL thread are started.
      It does not wait I/O thread to connect to master.
      Slave does not know status of master include if RESTORE command is running in master, 
      before I/O thread has connected to master.
      
      When a slave requests binlog dump from a master, it will send
      an ER_MASTER_BLOCKING_SLAVES error to the slave and then stop the
      connection if RESTORE command is running on it. The slave must report
      an error and then stop the I/O thread after it recieves the error from the master.
      This patch wrote code to report the error and then stop I/O thread
      when slave receives an error "ER_MASTER_BLOCKING_SLAVES".
[15 Sep 2009 13:52] Bugs System
Pushed into 5.4.4-alpha (revid:alik@sun.com-20090915134838-5nj3ycjfsqc2vr2f) (version source revid:li-bing.song@sun.com-20090905091947-qvhff5qgqugzr1tx) (merge vers: 5.4.4-alpha) (pib:11)
[16 Sep 2009 15:48] Jon Stephens
Documented in the 5.4.4 changelog as follows:

        START SLAVE succeeded even if the IO thread did not 
        connect to the master.

Closed.