MySQL Bugs: #71495: Logging of MTS breakage is insufficient.

Bug #71495	Logging of MTS breakage is insufficient.
Submitted:	27 Jan 2014 20:34	Modified:	2 Dec 2014 21:20
Reporter:	Simon Mudd (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.6.14	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
A MEM master (5.6.15) replicates to a MEM slave (5.6.14), both use RBR, GTID and slave_parallel_workers = 10.

Logging on MEM slave shows:

2014-01-27 19:47:27 4797 [Note] Slave: MTS group recovery relay log info group_master_log_name binlog.000601, event_master_log_pos 185160503.
2014-01-27 19:47:27 4797 [Note] Slave: MTS group recovery relay log info group_master_log_name binlog.000601, event_master_log_pos 185160894.
2014-01-27 19:47:27 4797 [Note] Slave: MTS group recovery relay log info group_master_log_name binlog.000601, event_master_log_pos 185161685.
2014-01-27 19:47:27 4797 [Note] Slave: MTS group recovery relay log info group_master_log_name binlog.000601, event_master_log_pos 185162108.
2014-01-27 19:47:28 4797 [Note] Slave SQL thread initialized, starting replication in log 'binlog.000601' at position 185112333, relay log '../log/relaylog.001798' position: 35857328
2014-01-27 19:47:28 4797 [Note] Slave: MTS Recovery has completed at relay log ../log/relaylog.001798, position 35909132 master log binlog.000601, position 185164137.
2014-01-27 19:47:28 4797 [ERROR] Slave SQL: Cannot execute the current event group in the parallel mode. Encountered event Format_desc, relay-log name ../log/relaylog.001799, position 235 which prevents execution of this event group in parallel mode. Reason: possible malformed group of events from an old master. Error_code: 1755
2014-01-27 19:47:28 4797 [Warning] Slave: Cannot execute the current event group in the parallel mode. Encountered event Format_desc, relay-log name ../log/relaylog.001799, position 235 which prevents execution of this event group in parallel mode. Reason: possible malformed group of events from an old master. Error_code: 1755
2014-01-27 19:47:28 4797 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'binlog.000601' position 185164137
2014-01-27 19:57:02 4797 [Note] Slave I/O thread exiting, read up to log 'binlog.000604', position 278692129

How to repeat:
See above:

Suggested fix:
Workaround was to do: set global slave_parallel_workers =0; stop slave; start slave;

2014-01-27 19:57:04 4797 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2014-01-27 19:57:04 4797 [Note] Slave SQL thread initialized, starting replication in log 'binlog.000601' at position 185164137, relay log '../log/relaylog.001798' position: 35909132
2014-01-27 19:57:04 4797 [Note] Slave I/O thread: connected to master 'replication_user@mymasterserver.example.com:3306',replication started in log 'binlog.000604' at position 278692129

On this system puppet will return the parallel_slave_threads back to 10 later.

So it is not clear:
(1) exactly what the problem is
(2) what the correct path is to solve the problem.

Given the server and slave have been working fine I just thought I'd try temporarily to set the slave_parallel_workers to 0 and see what happens but I don't know why it works now.

Therefore:
(a) please improve logging to explain what the real problem is on the slave, which statement or situation is causing problems.
(b) provide a better idea of what needs to be done to resolve this (or link to documentation)
(c) consider trying automatically to change the parallel_slave_threads value to 0 , and restore later (?)

This was triggered due to me trying to make a workaround for bug#71197 and scripting a periodic restart of the replication threads to avoid the memory leak.

see also BUG 17018841 - RUNTIME SEMI SYNC+GTID LAST_SQL_ERROR: CANNOT EXECUTE THE CURRENT EVENT GROUP IN

Same issue occurs with us with slave_parallel_workers = 4

Master: Ubuntu (14.04 LTS) running MySQL 5.6.16 x64
Slave Ubuntu (12.04 LTS) running MySQL 5.6.17 x64

Slave stop with 
Cannot execute the current event group in the parallel mode.

However, only ONE of our slaves was affected. The other slaves did not have this error.

Thank you for the report.

I cannot repeat error with STOP SLAVE; START SLAVE in version 5.6.23, but I agree that error message can be improved. Set to verified.

I encounter the same bug with mysql 5.6.23.
slave_parallel_workers=16

the output of 'show slave status\G'
mysql> show slave status\G
*************************** 1. row ***************************
               Slave_IO_State: Waiting for master to send event
                  Master_Host: 10.46.241.58
                  Master_User: mysqlsync
                  Master_Port: 3100
                Connect_Retry: 60
              Master_Log_File: mysql-bin.009808
          Read_Master_Log_Pos: 711805974
               Relay_Log_File: relay-log.007413
                Relay_Log_Pos: 283
        Relay_Master_Log_File: mysql-bin.009616
             Slave_IO_Running: Yes
            Slave_SQL_Running: No
              Replicate_Do_DB: 
          Replicate_Ignore_DB: 
           Replicate_Do_Table: 
       Replicate_Ignore_Table: 
      Replicate_Wild_Do_Table: 
  Replicate_Wild_Ignore_Table: mysql.%,test.%
                   Last_Errno: 1755
                   Last_Error: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name ./relay-log.007413, position 283 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode.
                 Skip_Counter: 0
          Exec_Master_Log_Pos: 701843116
              Relay_Log_Space: 119231733669
              Until_Condition: None
               Until_Log_File: 
                Until_Log_Pos: 0
           Master_SSL_Allowed: No
           Master_SSL_CA_File: 
           Master_SSL_CA_Path: 
              Master_SSL_Cert: 
            Master_SSL_Cipher: 
               Master_SSL_Key: 
        Seconds_Behind_Master: NULL
Master_SSL_Verify_Server_Cert: No
                Last_IO_Errno: 0
                Last_IO_Error: 
               Last_SQL_Errno: 1755
               Last_SQL_Error: Cannot execute the current event group in the parallel mode. Encountered event Xid, relay-log name ./relay-log.007413, position 283 which prevents execution of this event group in parallel mode. Reason: the event is a part of a group that is unsupported in the parallel execution mode.
  Replicate_Ignore_Server_Ids: 
             Master_Server_Id: 170848570
                  Master_UUID: 9aaa0f01-474c-11e6-a3c9-6c0b848338fc
             Master_Info_File: /ssd/mysql/ramon/ts/ts0/var/master.info
                    SQL_Delay: 0
          SQL_Remaining_Delay: NULL
      Slave_SQL_Running_State: 
              Semi_Sync_Group: 
           Master_Retry_Count: 86400
                  Master_Bind: 
      Last_IO_Error_Timestamp: 
     Last_SQL_Error_Timestamp: 161027 15:47:37
               Master_SSL_Crl: 
           Master_SSL_Crlpath: 
           Retrieved_Gtid_Set: 
            Executed_Gtid_Set: 
                Auto_Position: 0
1 row in set (0.00 sec)
======================================
and i found that  the first binlog event in relay-log relay-log.007413 is a XID event.
i think this event may cause the MTS break 
================================================
/*!50530 SET @@SESSION.PSEUDO_SLAVE_MODE=1*/;
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
DELIMITER /*!*/;
# at 4
#161020 12:11:32 server id 170936361  end_log_pos 120 CRC32 0x00e18f04  Start: binlog v 4, server v 5.6.23-baidu-trunk-2.2.1.1-debug-log created 161020 12:11:32
BINLOG '
9EMIWA8pSDAKdAAAAHgAAABAAAQANS42LjIzLWJhaWR1LXRydW5rLTIuMi4xLjEtZGVidWctbG9n
AAAAAAAAAAAAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAQSP
4QA=
'/*!*/;
# at 120
#700101  8:00:00 server id 170848570  end_log_pos 0 CRC32 0x56893f67    Rotate to mysql-bin.009616  pos: 701843116
# at 167
#161020 11:00:51 server id 170848570  end_log_pos 0 CRC32 0x363b9e6b    Start: binlog v 4, server v 5.6.23-baidu-table-parallel-repl-1.0.0.0-log created 161020 11:00:51
BINLOG '
YzMIWA868S4KdAAAAAAAAAAAAAQANS42LjIzLWJhaWR1LXRhYmxlLXBhcmFsbGVsLXJlcGwtMS4w
LjAuMC1sb2cAAAAAAAAAAAAAEzgNAAgAEgAEBAQEEgAAXAAEGggAAAAICAgCAAAACgoKGRkAAWue
OzY=
'/*!*/;
# at 283
#161020 11:37:48 server id 170848570  end_log_pos 701843147 CRC32 0x38827024    Xid = 3581848124
COMMIT/*!*/;
# at 314
#161020 11:37:48 server id 170848570  end_log_pos 701843238 CRC32 0xcb81368c    Query   thread_id=86023656      exec_time=0     error_code=0
================================================