Bug #81584 Slave worker threads randomly hang
Submitted: 25 May 2016 7:32 Modified: 11 Jun 2018 10:50
Reporter: Wei Zhao (OCA) Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.12 OS:Linux
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: hang, replication, slave

[25 May 2016 7:32] Wei Zhao
Description:
The slave mysqld process sometimes hangs if there are many slave worker threads and under big stress.

The symptom is that the slave process sometimes hangs. Using pstack to print the stacks of each threads of the mysql slave process, I can see all slave worker threads are either waiting for more jobs to work on, or are waiting as a follower for the leader thread to complete. And since the slave sql thread has to wait before prior transactions to commit, it can not give more transactions to the waiting worker threads to work on, so the slave process hangs.

So I saw the code of Stage_manager::signal_done() function and found that the mysql_cond_broadcast(&m_cond_done) is called after mysql_mutex_unlock(&m_lock_done), but according to the doc of pthread_cond_broadcast, it should be called before the mutex is released:

In this page, http://pubs.opengroup.org/onlinepubs/009695399/functions/pthread_cond_broadcast.html
Notice this sentence: "however, if predictable scheduling behavior is required, then that mutex shall be locked by the thread calling pthread_cond_broadcast() or pthread_cond_signal()."

So I made such changes and then the hang never happened again in many following test runs. Patch is attached.

How to repeat:
I set up a simple binlog replication with one slave replicating from one master, the config files for them are as below. The slave and master simply runs on one computer server (running a linux variant of centos, Linux TENCENT64.site 3.10.0-1-tlinux2-0014.tl2 #1 SMP Wed Mar 11 11:43:03 CST 2015 x86_64 x86_64 x86_64 GNU/Linux) with 24 CPUs, massive memory and fast SSD storage.

I use sysbench to do stress testing, sysbench parameters: ./sysbench --num-threads=200 --test=./tests/db/oltp.lua --oltp-table-size=100000 --oltp_auto_inc=off --oltp-tables-count=10  --report-interval=1 --mysql-host=127.0.0.1 --mysql-port=3338 --mysql-user=root --mysql-password=root --mysql-db=test  --max-time=3600 --max-requests=0 --mysql-engine-trx=yes  run

Slave config file:

[mysqld]
datadir=/mysql_data/mysql-instances/d/datadir
port=3339
tmpdir=/mysql_data/mysql-instances/d/tmpdir
socket=/mysql_data/mysql-instances/d/mysql.sock
server-id=13
innodb_flush_log_at_trx_commit=1
sync_binlog=1
log-bin=binlog
slave_parallel_type=logical_clock
slave_parallel_workers=128
gtid-mode=ON
enforce-gtid-consistency
relay-log=relay_log
log_slave_updates 

Master config file:

[mysqld]
datadir=/mysql_data/mysql-instances/c/datadir
port=3338
tmpdir=/mysql_data/mysql-instances/c/tmpdir
socket=/mysql_data/mysql-instances/c/mysql.sock
server-id=12
innodb_flush_log_at_trx_commit=1
sync_binlog=1
log-bin=binlog
slave_parallel_type=logical_clock
max_connections=255
gtid-mode=ON
enforce-gtid-consistency

Suggested fix:
In Stage_manager::signal_done() function, signal all follower threads before releasing the 'm_lock_done' mutex.

Patch as below:

diff --git a/sql/binlog.cc b/sql/binlog.cc
index e8ddabd..ef34f35 100644
--- a/sql/binlog.cc
+++ b/sql/binlog.cc
@@ -1950,8 +1950,8 @@ void Stage_manager::signal_done(THD *queue)
   mysql_mutex_lock(&m_lock_done);
   for (THD *thd= queue ; thd ; thd = thd->next_to_commit)
     thd->get_transaction()->m_flags.pending= false;
-  mysql_mutex_unlock(&m_lock_done);
   mysql_cond_broadcast(&m_cond_done);
+  mysql_mutex_unlock(&m_lock_done);
 }
 
 #ifndef DBUG_OFF
[26 May 2016 2:04] Wei Zhao
Adding one piece of information to the bug locating deduction: what the pstack shows means that the leader thread failed to signal the followers, so the followers kept waiting and the leader thread has finished committing all transactions and becomes blocked waiting for more work. Thus I though the Stage_manager::signal_done() function, which does the signaling, may have a problem.
[1 Aug 2016 17:24] MySQL Verification Team
also:
 http://bugs.mysql.com/bug.php?id=77730
[11 Jun 2018 10:50] MySQL Verification Team
As Shane already wrote I too believe this is a duplicate of https://bugs.mysql.com/bug.php?id=77730
[25 Jul 2019 14:39] Demon Chen
http://bugs.mysql.com/bug.php?id=77730 I can not open it.. 

Could you tell me which version fix this bug
[25 Jul 2019 20:49] MySQL Verification Team
Hi,
It is not fixed yet
[3 Jul 2020 8:37] WANG GUANGYOU
met the same problem too!