Bug #45852 Semisynch: Last_IO_Error: Fatal error: Failed to run 'after_queue_event' hook
Submitted: 30 Jun 2009 11:48 Modified: 12 Nov 2009 14:27
Reporter: Philip Stoev Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.4 OS:Any
Assigned to: Zhenxing He CPU Architecture:Any
Tags: experimental, pb2, test failure

[30 Jun 2009 11:48] Philip Stoev
Description:
When executing a replication workload with semisynch replication turned off and periodic STOP SLAVE IO_THREAD thread, START SLAVE IO_THREAD (in order to simulate network failures), the slave stopped with the following information in SHOW SLAVE STATUS:

mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State:
                  Master_Host: 127.0.0.1
                  Master_User: root
                  Master_Port: 19306
                Connect_Retry: 1
              Master_Log_File: master-bin.000001
          Read_Master_Log_Pos: 1900343
               Relay_Log_File: slave-relay-bin.000003
                Relay_Log_Pos: 1283333
        Relay_Master_Log_File: master-bin.000001
             Slave_IO_Running: No
            Slave_SQL_Running: Yes
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 0
                   Last_Error:
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1900343
              Relay_Log_Space: 1900793
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 1593
                Last_IO_Error: Fatal error: Failed to run 'after_queue_event' hook
               Last_SQL_Errno: 0
               Last_SQL_Error:
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
1 row in set (0.00 sec)

The slave error log contains:

090630 13:58:12 [ERROR] Set 'rpl_semi_sync_master_reply_log_file_pos' on master failed
090630 13:58:12 [ERROR] Slave I/O: Fatal error: Failed to run 'after_queue_event' hook, Error_code: 1593
090630 13:58:12 [Note] Slave I/O thread exiting, read up to log 'master-bin.000001', position 1900343

which implies that the master is at fault, however the master error log does not contain any information about what happened.

Also, the error is reported as "Fatal", but START SLAVE IO_THREAD is able to reestablish replication and Rpl_semi_sync_master_status converts back to ON, so how "Fatal" the error was remains an open question.

How to repeat:
With the Random Query Generator from the mysql-test-extra-6.0 tree, mysql-test/gentest, run:

$ perl runall.pl \
  --basedir=/build/bzr/azalea-bugfixing \
  --grammar=conf/replication_simple.yy \
  --rpl_mode=statement \
  --mem \
  --mysqld=--plugin-dir=/build/bzr/azalea-bugfixing/plugin/semisync/.libs \
  --mysqld=--plugin-load=rpl_semi_sync_master=libsemisync_master.so:rpl_semi_sync_slave=libsemisync_slave.so \
  --mysqld=--rpl_semi_sync_master_enabled=1 \
  --mysqld=--rpl_semi_sync_slave_enabled=1 \
  --duration=1800 \
  --queries=100000000 \
  --threads=10 \
  --validator=ReplicationSlaveStatus \
  --reporter=ReplicationSemiSync 

The ReplicationSemiSync reporter causes the slave to stop for random periods of time and then attempt to reestablish semisynchronous replication. The ReplicationSlaveStatus monitors SHOW SLAVE STATUS and will terminate the test as soon as the error occurs, usually within a minute (a few START|STOP SLAVE IO THREAD cycles).

Suggested fix:
Semisynchronous replication should survive IO thread issues without throwing error messages.

The actual cause of each failure must be present if not in SHOW SLAVE STATUS, then at least in the error log. If the slave error log talks about a failure on the master, the error log on the master needs to contain an entry and debug information for that failure.
[30 Jun 2009 13:30] Zhenxing He
The error is on slave, not on master, maybe the error message 'Set 'rpl_semi_sync_master_reply_log_file_pos' on master failed' is misleading, better say 'Send semi-sync reply to master failed'. This error is mostly caused by connection problem.

Fatal here means that the slave IO thread has to stop because of the error, but it's possible to restart slave IO thread.
[30 Jun 2009 13:51] Philip Stoev
The test uses only STOP SLAVE IO_THREAD and it does not attempt to disrupt the connection between the master and the slave in any other way. Can we have extra debugging information in case this error happens so that this can be debugged further?
[5 Sep 2009 8:23] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82521

2802 He Zhenxing	2009-09-05
      Bug#45852 Semisynch: Last_IO_Error: Fatal error: Failed to run 'after_queue_event' hook
      
      Errors when send reply to master should never cause the IO thread
      to stop, because master can fall back to async replication if it
      does not get reply from slave.
      
      The problem is fixed by deliberately ignore the return value of
      slaveReply.
     @ plugin/semisync/semisync_slave_plugin.cc
        Deliberately ignore the return value of slaveReply so that errors
        while sending slave reply will not cause the IO thread to stop.
[8 Sep 2009 4:41] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82622

2802 He Zhenxing	2009-09-08
      Bug#45852 Semisynch: Last_IO_Error: Fatal error: Failed to run 'after_queue_event' hook
      
      Errors when send reply to master should never cause the IO thread
      to stop, because master can fall back to async replication if it
      does not get reply from slave.
      
      The problem is fixed by deliberately ignore the return value of
      slaveReply.
     @ plugin/semisync/semisync_slave.cc
        Change the error message when send reply failed
     @ plugin/semisync/semisync_slave_plugin.cc
        Deliberately ignore the return value of slaveReply so that errors
        while sending slave reply will not cause the IO thread to stop.
[8 Sep 2009 6:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/82625

2802 He Zhenxing	2009-09-08
      Bug#45852 Semisynch: Last_IO_Error: Fatal error: Failed to run 'after_queue_event' hook
      
      Errors when send reply to master should never cause the IO thread
      to stop, because master can fall back to async replication if it
      does not get reply from slave.
      
      The problem is fixed by deliberately ignoring the return value of
      slaveReply.
     @ plugin/semisync/semisync_slave.cc
        Change the error message when send reply failed
     @ plugin/semisync/semisync_slave_plugin.cc
        Deliberately ignore the return value of slaveReply so that errors
        while sending slave reply will not cause the IO thread to stop.
[26 Sep 2009 10:32] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/84715

2837 He Zhenxing	2009-09-26 [merge]
      Merge BUG#45852 to 6.0-codebase-bugfixing
[30 Sep 2009 3:58] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/85120

3627 He Zhenxing	2009-09-30
      Bug#45852 Semisynch: Last_IO_Error: Fatal error: Failed to run 'after_queue_event' hook
      
      Errors when send reply to master should never cause the IO thread
      to stop, because master can fall back to async replication if it
      does not get reply from slave.
      
      The problem is fixed by deliberately ignoring the return value of
      slaveReply.
     @ plugin/semisync/semisync_slave_plugin.cc
        Deliberately ignore the return value of slaveReply so that errors
        while sending slave reply will not cause the IO thread to stop.
[9 Oct 2009 8:47] Bugs System
Pushed into 6.0.14-alpha (revid:alik@ibmvm-20091009083208-0o0f0i9w1sq3c1kn) (version source revid:zhenxing.he@sun.com-20090930035829-z5k0sbc0s6pjx30e) (merge vers: 6.0.14-alpha) (pib:12)
[9 Oct 2009 10:43] Jon Stephens
Documented bugfix in the 6.0.14 changelog as follows:

        When using semi-synchronous replication, an error raised on the
        slave while sending a response to the master caused the slave IO
        thread to stop.

Set status to NDI, waiting for push to 5.4.
[12 Oct 2009 12:57] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/86574

3115 He Zhenxing	2009-10-12
      Backport Bug#45852 Semisynch: Last_IO_Error: Fatal error: Failed to run 'after_queue_event' hook
      
      Errors when send reply to master should never cause the IO thread
      to stop, because master can fall back to async replication if it
      does not get reply from slave.
      
      The problem is fixed by deliberately ignoring the return value of
      slaveReply.
[27 Oct 2009 9:49] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20091027094604-9p7kplu1vd2cvcju) (version source revid:zhenxing.he@sun.com-20091026140226-uhnqejkyqx1aeilc) (merge vers: 6.0.14-alpha) (pib:13)
[28 Oct 2009 6:37] Jon Stephens
Already documented for 6.0.14; re-closing.
[12 Nov 2009 8:18] Bugs System
Pushed into 5.5.0-beta (revid:alik@sun.com-20091110093229-0bh5hix780cyeicl) (version source revid:alik@sun.com-20091027095744-rf45u3x3q5d1f5y0) (merge vers: 5.5.0-beta) (pib:13)
[12 Nov 2009 14:27] Jon Stephens
Also documented in the 5.5.0 changelog; closed.
[18 Dec 2009 15:41] Paul DuBois
Removed 5.5.0 changelog entry. In 5.5, semisync replication first appears in 5.5.0, so this bug affects no 5.5.x releases.