Description:
In Multi-threaded binlog replication. When Coordinator thread reads EOF, the condition enter wait_for_workers_to_finish is wrong.
It only checks "m_rli->is_receiver_waiting_for_rl_space.load()" I/O thread is waiting for log space and !m_rli->is_in_group() to ensure coordinator thread is not in middle of a group, but is_in_group() is only for single threaded and it will always return false in multi-threaded mode. That can cause coordinator enter wait_for_workers_to_finish when it's reading in middle of a transaction.
This can occur when a transaction spans between 2 relay log files (can be caused by flush relay logs from my experiment).
The replication can stuck at a scenario below:
Slave_IO_State: Waiting for the replica SQL thread to free relay log space
Slave_SQL_Running_State: waiting for handler commit
And all worker threads are idle.
How to repeat:
Make sure relay log space is full, I/O thread is waiting for log space
generate a transaction that spans two relay log files by flush relay logs.
When EOF is read, workers will finish all their jobs and stay in idle, while coordinator thread stuck at wait_for_workers_to_finish because the last job assign to worker can not be finished and marked done.
Suggested fix:
change !m_rli->is_in_group() to something like
(!m_rli->is_parallel_exec() && !m_rli->is_in_group()) || (m_rli->is_parallel_exec() && !m_rli->is_mts_in_group()))