Bug #117407 Binlog Replication: SQL Coordinator thread hang in "waiting for handler commit"
Submitted: 7 Feb 1:42 Modified: 10 Feb 18:03
Reporter: Jiaqi Tian Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:8.0.38+ OS:Any
Assigned to: CPU Architecture:Any

[7 Feb 1:42] Jiaqi Tian
Description:
In Multi-threaded binlog replication. When Coordinator thread reads EOF, the condition enter wait_for_workers_to_finish is wrong. 

It only checks "m_rli->is_receiver_waiting_for_rl_space.load()" I/O thread is waiting for log space and !m_rli->is_in_group() to ensure coordinator thread is not in middle of a group, but is_in_group() is only for single threaded and it will always return false in multi-threaded mode. That can cause coordinator enter wait_for_workers_to_finish when it's reading in middle of a transaction. 

This can occur when a transaction spans between 2 relay log files (can be caused by flush relay logs from my experiment). 

The replication can stuck at a scenario below:

Slave_IO_State: Waiting for the replica SQL thread to free relay log space
Slave_SQL_Running_State: waiting for handler commit

And all worker threads are idle.

How to repeat:
Make sure relay log space is full, I/O thread is waiting for log space
generate a transaction that spans two relay log files by flush relay logs. 
When EOF is read, workers will finish all their jobs and stay in idle, while coordinator thread stuck at wait_for_workers_to_finish because the last job assign to worker can not be finished and marked done.

Suggested fix:
change !m_rli->is_in_group() to something like
(!m_rli->is_parallel_exec() && !m_rli->is_in_group()) || (m_rli->is_parallel_exec() && !m_rli->is_mts_in_group()))
[7 Feb 8:37] MySQL Verification Team
Thank you for your report.
[10 Feb 18:03] Jiaqi Tian
It seems the problem is already solved in https://github.com/mysql/mysql-server/commit/9c62600827b5ff5a0e34b45a0ee7145eac56ffa7