MySQL Bugs: #72537: Replication stalls using Parallel Workers

Bug #72537	Replication stalls using Parallel Workers
Submitted:	5 May 2014 14:22	Modified:	5 Jun 2018 0:17
Reporter:	Van Stokes	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	5.6.17	OS:	Any (Windows,Linux)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	myisam, parallel, replication, slave, workers

Description:
Master Server: Ubuntu 14.04 LTS (x64)
Master MySQL Server: 5.6.16 (x64) - Ubuntu Distro
Slave Server: Windows Server 2008 (x64)
Slave MySQL Server: 5.6.17 (x64) - MySQL (Oracle) Distro
Using Parallel Workers: Yes (4)

When replicating a MyISAM database and if the table being updated by the replication slave worker(s) is marked crashed, all replication stops without a warning. An entry is placed into the error log informing of the crashed table.

Moreover, you cannot successfully execute a STOP SLAVE command. The command hangs indefinitely. When viewing the process list, it says 'killing slave'. Eventually, you have to terminate the server via END PROCESS (Windows) or kill -9 (Linux).

How to repeat:
MySQL Configuration:

# GTID Global Transactional ID's
gtid-mode = on
enforce-gtid-consistency = true

###############################
# Replication: SLAVE
#
log-slave-updates = 1
relay-log = slave-relay-bin
relay-log-index = slave-relay-bin.index
relay-log-info-repository = TABLE
relay-log-recovery = 1 # Added 2014-05-05
replicate-same-server-id = 0
slave-net-timeout = 15
slave-parallel-workers = 4
slave-pending-jobs-size-max = 32M # default 16M
slave-sql-verify-checksum = 0

# ONLY replicate changes from the master for the following db:
#replicate-do-db =
replicate-do-db = eci_edi_eculine
replicate-do-db = eci_elite_prod
replicate-do-db = oti_elite_prod

Replicate a MyISAM database, crash one of the MyISAM tables on the slave. Make sure there are updates that the slave needs to apply to the MyISAM crashed table. Perhaps have other updates to other MyISAM tables and MyISAM tables in other Databases (which we do have going on).

Suggested fix:
I suspect there is deadlock or race condition occurring. The worker cannot apply the update because the table is crashed.

BTW, after we restarted the server, slaves disabled, s time we got the:

2014-05-05 09:36:21 4164 [ERROR] Slave SQL: Cannot execute the current event group in the parallel mode. Encountered event Format_desc, relay-log name .\slave-relay-bin.000243, position 17951 which prevents execution of this event group in parallel mode. Reason: possible malformed group of events from an old master. Error_code: 1755
2014-05-05 09:36:21 4164 [Warning] Slave: Cannot execute the current event group in the parallel mode. Encountered event Format_desc, relay-log name .\slave-relay-bin.000243, position 17951 which prevents execution of this event group in parallel mode. Reason: possible malformed group of events from an old master. Error_code: 1755
2014-05-05 09:36:21 4164 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'master-bin.000014' position 489027972

I modified the Synopsis because we have now encountered this "stall" when no relative errors are logged in the error.log.

OS: Ubuntu 14.04 LTS (x64)
MySQL Server: 5.6.16 (x64) Ubuntu Distro

Here is some diagnostics where the SLAVE has stalled again.
1) Show the process list (file p1.txt)
2) Show slave status (file s.txt)
3) STOP SLAVE command issued (client is hung performing STOP SLAVE)
4) show process list (file p2.txt)

SHOW FULL PROCESSLIST

Attachment: p1.txt (text/plain), 3.42 KiB.

SHOW SLAVE STATUS

Attachment: s.txt (text/plain), 2.27 KiB.

SHOW FULL PROCESSLIST after STOP SLAVE

Attachment: p2.txt (text/plain), 3.46 KiB.

Had to restart the MySQL server because of hung slave.

# service mysql stop
(hung attempting to shutdown)

from the mysql error.log file:

2014-05-07 10:00:04 8313 [Note] /usr/sbin/mysqld: Normal shutdown

2014-05-07 10:00:04 8313 [Note] Giving 19 client threads a chance to die gracefully
2014-05-07 10:00:04 8313 [Note] Event Scheduler: Purging the queue. 2 events
2014-05-07 10:00:04 8313 [Note] Shutting down slave threads
2014-05-07 10:00:04 8313 [Note] Slave I/O thread killed while reading event
2014-05-07 10:00:04 8313 [Note] Slave I/O thread exiting, read up to log 'master-bin.000093', position 95623245
(no further entries)

Typically, the server takes around one minute to shutdown. The server took over five minutes to shutdown.

Error log entries on startup:

2014-05-07 10:06:08 2103 [Note] InnoDB: The log sequence numbers 80885068473 and 80885068473 in ibdata files do not match the log sequence number 107226483453 in the ib_logfiles!
2014-05-07 10:06:08 2103 [Note] InnoDB: Database was not shutdown normally!
2014-05-07 10:06:08 2103 [Note] InnoDB: Starting crash recovery.
2014-05-07 10:06:08 2103 [Note] InnoDB: Reading tablespace information from the .ibd files...
2014-05-07 10:06:29 2103 [Note] InnoDB: Restoring possible half-written data pages
2014-05-07 10:06:29 2103 [Note] InnoDB: from the doublewrite buffer...
InnoDB: 1 transaction(s) which must be rolled back or cleaned up
InnoDB: in total 3199 row operations to undo
InnoDB: Trx id counter is 224093184
InnoDB: Last MySQL binlog file position 0 617585925, file name master-bin.000091
2014-05-07 10:06:33 2103 [Note] InnoDB: 128 rollback segment(s) are active.
InnoDB: Starting in background the rollback of uncommitted transactions
2014-05-07 10:06:33 7f3e7e45b700  InnoDB: Rolling back trx with id 217297891, 3199 rows to undo
InnoDB: Progress in percents: 1 2 3 4 5 6 7 8 9 10 11 12 13 14 152014-05-07 10:06:34 2103 [Note] InnoDB: Waiting for purge to start
 162014-05-07 10:06:34 2103 [Note] InnoDB: 5.6.17 started; log sequence number 107226483453
2014-05-07 10:06:34 7f3e837fe700 InnoDB: Loading buffer pool(s) from .//ib_buffer_pool
2014-05-07 10:06:34 2103 [Note] Recovering after a crash using master-bin
 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 1002014-05-07 10:06:34 2103 [Note] InnoDB: Rollback of trx with id 217297891 completed
2014-05-07 10:06:34 7f3e7e45b700  InnoDB: Rollback of non-prepared transactions completed
2014-05-07 10:06:35 7f3e837fe700 InnoDB: Buffer pool(s) load completed at 140507 10:06:35
2014-05-07 10:06:37 2103 [Note] Starting crash recovery...
2014-05-07 10:06:37 2103 [Note] Crash recovery finished.

And now my SLAVE hang is back on the same transaction!

6 system user (NULL) Connect 4717 Waiting for an event from Coordinator	INSERT INTO lcl_exp_ratesheet SET comcde = pCom...

I had to shutdown the MySQL server - again - because the same transaction was hung the SLAVE. Took over five minutes to shutdown and it still did not shutdown cleanly (per the error log).

I suspected the problem was in the relay log so I added this parameter to my my.cnf file:

relay-log-recovery = 1

I restarted MySQL and that entry seemed to work. The slave started running again.

Current slave settings (from my.cnf):

###############################
# Replication: SLAVE
log-slave-updates = 1
relay-log = slave-relay-bin
relay-log-index = slave-relay-bin.index
relay-log-info-repository = TABLE
relay-log-recovery = 1 # Added 2014-05-07
replicate-same-server-id = 0
skip-slave-start = 0
slave-net-timeout = 15
slave-parallel-workers = 4
slave-pending-jobs-size-max = 32M # default 16M
slave-sql-verify-checksum = 1

Slave keeps stalling. This time, there doesn't seem to be any SQL activity in the process list being executed. Issued another STOP SLAVE from the client and the client is hung. Process List shows me attempting to kill the slave. I had to forcibly stop the server again. Escalated this to critical. I cannot believe I am the only one with this issue.

I have changed my configuration:

slave-parallel-workers = 4
to
slave-parallel-workers = 0

to see if this helps.

Setting slave-parallel-workers = 0 is the only solution.

Today I tried to set it back to four only to have it stall again. I issued a STOP SLAVE which hung. No slave threads were performing any work. This time, I killed each slave thread and one of them freed up my hung STOP SLAVE. The slave system was stopped. I did not have to stop the server.

This particular server handles mostly MyISAM databases and replicates them downstream to other masters and slaves. It imports data dumped by a mainframe system into a staging database (MyISAM, not replicated, not binlogged) and then 'syncs' the staging database to the replicated database (MyISAM, binlogged) using Webyogs SJA. This MyISAM database is treated as READ-ONLY on all downstream masters and slaves. Only this server modifies this MyISAM database.

My only assumption here is that MyISAM is not properly handled by the parallel slave system.

Well, killing the slave thread was not good. I cannot restart slave. I get errors in the error log:

2014-05-09 10:03:49 4085 [Warning] Slave SQL: Coordinator thread of multi-threaded slave is being stopped in the middle of assigning a group of events; deferring to exit until the group completion ... , Error_code: 0
2014-05-09 10:03:49 4085 [Note] Slave SQL thread exiting, replication stopped in log 'master-bin.000107' at position 120
2014-05-09 10:03:53 4085 [ERROR] Error reading packet from server: Lost connection to MySQL server during query ( server_errno=2013)
2014-05-09 10:03:53 4085 [Note] Slave I/O thread killed while reading event
2014-05-09 10:03:53 4085 [Note] Slave I/O thread exiting, read up to log 'master-bin.000109', position 923623464
2014-05-09 10:15:34 4085 [Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name ./slave-relay-bin.000256, group_relay_log_pos 100595791 group_master_log_name master-bin.000107, group_master_log_pos 979175316
2014-05-09 10:15:35 4085 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 1001471773.
2
.... many lines of "Slave: MTS group recovery relay log info" ....
2014-05-09 10:15:58 4085 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 923607991.
2014-05-09 10:15:58 4085 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 923623464.
2014-05-09 10:15:58 4085 [ERROR] Error looking for file after ./slave-relay-bin.000261.

I had to stop the server.
Now I have this error:
Error No: 1872 Slave failed to initialize relay log info structure from the repository

2014-05-09 10:47:40 4845 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 923623464.
2014-05-09 10:47:40 4845 [Note] Slave: MTS group recovery relay log info group_master_log_name master-bin.000107, event_master_log_pos 908532420.
2014-05-09 10:47:40 4845 [ERROR] Error looking for file after ./slave-relay-bin.000262.
2014-05-09 10:47:40 4845 [ERROR] Failed to initialize the master info structure
2014-05-09 10:47:40 4845 [Note] Check error log for additional messages. You will not be able to start replication until the issue is resolved and the server restarted.
2014-05-09 10:47:40 4845 [Note] Event Scheduler: Loaded 2 events
2014-05-09 10:47:40 4845 [Note] /usr/sbin/mysqld: ready for connections.
Version: '5.6.17-0ubuntu0.14.04.1-log'  socket: '/var/run/mysqld/mysqld.sock'  port: 3306  (Ubuntu)
2014-05-09 10:54:32 7f7408d2b700 InnoDB: Buffer pool(s) load completed at 140509 10:54:32
2014-05-09 10:58:17 4845 [ERROR] Slave SQL: Slave failed to initialize relay log info structure from the repository, Error_code: 1872

seeing this on 5.7.14 when i stop slave and start slave again.

mysql> stop slave;
Query OK, 0 rows affected (0.00 sec)

mysql> start slave;
ERROR 1201 (HY000): Could not initialize master info structure; more error messages can be found in the MySQL error log

Error log contains:
-----
[Note] Slave I/O thread exiting for channel '', read up to log 'i7-bin.000001', position 51698390
[Note] Slave: MTS group recovery relay log info based on Worker-Id 0, group_relay_log_name .\thei7-relay.000007, group_relay_log_pos 258564868 group_master_log_name i7-bin.000001, group_master_log_pos 273638888
[ERROR] Error looking for file after .\thei7-relay.000009.

Hi,
I can reproduce this "partially", using parallel workers and crashed myisam tables but
 - I can see errors
 - the tables actually are crashed

I'm not sure why would that be a bug. Initially you state it "silently" stopped replication but I can see errors (5.7) and myisam is not very "safe" tbh in the first place. With innodb I can't reproduce and I can't reproduce scenario when nothing is in the logs.

all best
Bogdan

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".