Bug #81386 MTS hangs on file related *_EVENT forced checkpoint
Submitted: 12 May 2016 5:41 Modified: 12 May 2016 5:50
Reporter: Trey Raymond Email Updates:
Status: Open Impact on me:
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.6.29 OS:Any
Assigned to: Umesh Shastry CPU Architecture:Any

[12 May 2016 5:41] Trey Raymond
mts forces a checkpoint at file rollover due to those binlog events

this can lead to unexpected slave hangs where all worker threads are waiting on one.  something big at the start of a file might behave just fine, but one towards the end causes severe lag.

How to repeat:
- set up a master/slave with multiple schemas that have write traffic going to them, and MTS enabled with a few threads
- create a table and populate with quite a few GB of data.  format doesn't matter, just size
- truncate this on the master (keep data on the slave)
- show master status until near the end of a binlog based on max_binlog_size
- alter table test_table engine=innodb; (with no data on master this gets into the repl stream immediately)
- observe mysql.slave_worker_info on slave, correlate with processlist/p_s threads, you'll see one executing the big alter, and one or more executing transactions on the other dbs
- wait for the threads' log file pos to hit the end of the binlog, they will stall waiting for a checkpoint, which the thread altering can't do until it is finished - thus, it's back to a single thread blocking, defeating the purpose of MTS

Suggested fix:
you can reduce the chance of this happening by increasing max_binlog_size, but that's not infinitely sustainable, and due to chance of exec time it can still cause major issues even with huge files.

fix would be to let the worker threads gracefully handle 'binlog management' events as specified in https://dev.mysql.com/doc/internals/en/binlog-event.html - this may be difficult to implement...maybe:

- detect events related to end of binlog/rotate to next binlog/start
- select next available worker thread for this batch of events in the same method workers are selected for batches of events on an actual database
- have that worker process the events gracefully, only the coordinator would wait on it
- coordinator can continue processing the next log once that batch is done by the worker

that's off the top of my head, it will be more complex in practice, but there's definitely a better way to handle this.
[12 May 2016 5:50] Trey Raymond
peeking into 5.7 code, looks like a dev noted this issue as well and had some comments (but no change in the code):