MySQL Bugs: #117407: Binlog Replication: SQL Coordinator thread hang in "waiting for handler commit"

Bug #117407	Binlog Replication: SQL Coordinator thread hang in "waiting for handler commit"
Submitted:	7 Feb 1:42	Modified:	10 Feb 18:03
Reporter:	Jiaqi Tian	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	8.0.38+	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In Multi-threaded binlog replication. When Coordinator thread reads EOF, the condition enter wait_for_workers_to_finish is wrong. 

It only checks "m_rli->is_receiver_waiting_for_rl_space.load()" I/O thread is waiting for log space and !m_rli->is_in_group() to ensure coordinator thread is not in middle of a group, but is_in_group() is only for single threaded and it will always return false in multi-threaded mode. That can cause coordinator enter wait_for_workers_to_finish when it's reading in middle of a transaction. 

This can occur when a transaction spans between 2 relay log files (can be caused by flush relay logs from my experiment). 

The replication can stuck at a scenario below:

Slave_IO_State: Waiting for the replica SQL thread to free relay log space
Slave_SQL_Running_State: waiting for handler commit

And all worker threads are idle.

How to repeat:
Make sure relay log space is full, I/O thread is waiting for log space
generate a transaction that spans two relay log files by flush relay logs. 
When EOF is read, workers will finish all their jobs and stay in idle, while coordinator thread stuck at wait_for_workers_to_finish because the last job assign to worker can not be finished and marked done.

Suggested fix:
change !m_rli->is_in_group() to something like
(!m_rli->is_parallel_exec() && !m_rli->is_in_group()) || (m_rli->is_parallel_exec() && !m_rli->is_mts_in_group()))

Thank you for your report.

It seems the problem is already solved in https://github.com/mysql/mysql-server/commit/9c62600827b5ff5a0e34b45a0ee7145eac56ffa7