MySQL Bugs: #77239: SQL thread status information is lost if multi-threaded slave is enabled

Bug #77239	SQL thread status information is lost if multi-threaded slave is enabled
Submitted:	4 Jun 2015 2:24	Modified:	4 Jun 2015 7:20
Reporter:	Davi Arnaut (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6, 5.6.26	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Last_SQL_Errno, Last_SQL_Error, MTS, show slave status, sql thread

Description:
When multi-threaded slave is enabled, an error that causes the SQL worker threads to stop is not preserved in the slave status information (SHOW SLAVE STATUS).

How to repeat:
Setup a slave with slave_parallel_workers greater than 0.

master> CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=InnoDB;
Query OK, 0 rows affected (0.03 sec)

master> INSERT INTO t1 VALUES (1);
Query OK, 1 row affected (0.01 sec)

slave> SET GLOBAL innodb_lock_wait_timeout = 1;
Query OK, 0 rows affected (0.00 sec)

slave> BEGIN;
Query OK, 0 rows affected (0.00 sec)

slave> SELECT * FROM t1 WHERE a = 1 FOR UPDATE;
+---+
| a |
+---+
| 1 |
+---+
1 row in set (0.00 sec)

slave> SELECT SLEEP(1);
+----------+
| SLEEP(1) |
+----------+
|        0 |
+----------+
1 row in set (1.00 sec)

slave> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 127.0.0.1
                  Master_User: root
                  Master_Port: 5555
                Connect_Retry: 60
              Master_Log_File: mysql-bin.000001
          Read_Master_Log_Pos: 393
               Relay_Log_File: mysqld-relay-bin.000002
                Relay_Log_Pos: 361
        Relay_Master_Log_File: mysql-bin.000001
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: 
                   Last_Errno: 1756
                   Last_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details).
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 151
              Relay_Log_Space: 808
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1756
               Last_SQL_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details).
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 10
                  Master_UUID: 157c7cfa-0a32-11e5-9002-e83935312055
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 150603 19:21:16
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 157c7cfa-0a32-11e5-9002-e83935312055:1
            Executed_Gtid_Set: 
                Auto_Position: 0
1 row in set (0.00 sec)

Last_SQL_Errno is 1756 (ER_MTS_INCONSISTENT_DATA), the timeout error information is not available.

Error log:

2015-06-03 19:21:16 13646 [Warning] Slave SQL: Worker 49 failed executing transaction '157c7cfa-0a32-11e5-9002-e83935312055:1' at master log mysql-bin.000001, end_log_pos 362; Could not execute Update_rows event on table test.t1; Lock wait timeout exceeded; try restarting transaction, Error_code: 1205; handler error HA_ERR_LOCK_WAIT_TIMEOUT; the event's master log FIRST, end_log_pos 362, Error_code: 1205
2015-06-03 19:21:16 13646 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2015-06-03 19:21:16 13646 [Note] Error reading relay log event: slave SQL thread was killed
2015-06-03 19:21:16 13646 [Warning] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2015-06-03 19:21:16 13646 [Warning] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

Hello Davi,

Thank you for report and test case.
Observed this with latest 5.6.26 build.

Thanks,
Umesh

// Slave

mysql> SHOW SLAVE STATUS\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: localhost
                  Master_User: repl
                  Master_Port: 15000
                Connect_Retry: 60
              Master_Log_File: master-bin.000003
          Read_Master_Log_Pos: 1839
               Relay_Log_File: slave-relay-bin.000003
                Relay_Log_Pos: 1095
        Relay_Master_Log_File: master-bin.000003
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB:
          Replicate_Ignore_DB:
           Replicate_Do_Table:
       Replicate_Ignore_Table:
      Replicate_Wild_Do_Table:
  Replicate_Wild_Ignore_Table:
                   Last_Errno: 1756
                   Last_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details).
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 1624
              Relay_Log_Space: 1647
              Until_Condition: None
               Until_Log_File:
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File:
           Master_SSL_CA_Path:
              Master_SSL_Cert:
            Master_SSL_Cipher:
               Master_SSL_Key:
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error:
               Last_SQL_Errno: 1756
               Last_SQL_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details).
  Replicate_Ignore_Server_Ids:
             Master_Server_Id: 1
                  Master_UUID: e447b07d-0a83-11e5-9218-0010e05f3e06
             Master_Info_File: mysql.slave_master_info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State:
           Master_Retry_Count: 86400
                  Master_Bind:
      Last_IO_Error_Timestamp:
     Last_SQL_Error_Timestamp: 150604 09:16:00
               Master_SSL_Crl:
           Master_SSL_Crlpath:
           Retrieved_Gtid_Set:
            Executed_Gtid_Set:
                Auto_Position: 0
1 row in set (0.00 sec)

##

2015-06-04 09:16:00 7429 [Warning] Slave SQL: Worker 4 failed executing transaction '' at master log master-bin.000003, end_log_pos 1808; Error 'Lock wait timeout exceeded; try restarting transaction' on query. Default database: 'test'. Query: 'update t1 set a = 3 where a = 2', Error_code: 1205
2015-06-04 09:16:00 7429 [ERROR] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2015-06-04 09:16:00 7429 [Note] Error reading relay log event: slave SQL thread was killed
2015-06-04 09:16:00 7429 [Warning] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756
2015-06-04 09:16:00 7429 [Warning] Slave SQL: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details). Error_code: 1756

Bug #80180 marked as duplicate of this

The problem is also that a regular stop slave/replica occasionally throws 1756 error for no apparent reason, even if resuming replication works without problems later. So this error is bit annoying and may trigger unnecessary alerts.

Repeated easily with 8.0.25 with generic 4 tables sysbench run:

slave1 > select @@version,@@version_comment,@@slave_parallel_workers,@@slave_parallel_type\G
*************************** 1. row ***************************
               @@version: 8.0.25
       @@version_comment: MySQL Community Server - GPL
@@slave_parallel_workers: 8
   @@slave_parallel_type: LOGICAL_CLOCK
1 row in set (0.00 sec)

slave1 > start slave; select sleep(1); stop slave; 
Query OK, 0 rows affected, 1 warning (0.02 sec)

slave1 > show slave status\G
*************************** 1. row ***************************
 ...
                   Last_Errno: 1756
                   Last_Error: ... The slave coordinator and worker threads are stopped, possibly leaving data in inconsistent state. A restart should restore consistency automatically, although using non-transactional storage for data or info tables or DDL queries could lead to problems. In such cases you have to examine your data (see documentation for details).