MySQL Bugs: #91671: stop slave sql_thread for channel 'group_replication

Bug #91671	stop slave sql_thread for channel 'group_replication_applier' could not return
Submitted:	17 Jul 2018 2:08	Modified:	27 Nov 2018 12:05
Reporter:	Zhenghu Wen (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	5.7.22	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
when doing sysbench prepare with MGR single primary mode, i exec "stop slave sql_thread for channel 'group_replication_applier'" in secondary node, and it could not return.

node2>show processlist;
+----+-------------+-----------+--------------------+---------+------+-----------------------------+---------------------------------------------------------------+
| Id | User        | Host      | db                 | Command | Time | State                       | Info                                                          |
+----+-------------+-----------+--------------------+---------+------+-----------------------------+---------------------------------------------------------------+
|  8 | system user |           | NULL               | Connect | 1173 | executing                   | NULL                                                          |
| 11 | system user |           | NULL               | Connect |  778 | Waiting for workers to exit | NULL                                                          |
| 12 | system user |           | NULL               | Connect |  778 | System lock                 | NULL                                                          |
| 56 | root        | localhost | performance_schema | Query   |    0 | starting                    | show processlist                                              |
| 57 | root        | localhost | performance_schema | Query   |  778 | Killing slave               | stop slave sql_thread for channel 'group_replication_applier' |
+----+-------------+-----------+--------------------+---------+------+-----------------------------+---------------------------------------------------------------+
5 rows in set (0.00 sec)

the stack show that it has a deadlock on replication channel lock 'm_channel_lock'. 
a. stop slave has obtained write lock, and waiting for all worker threads and coordinate thread left.
Thread 9 (Thread 0x7fd10cf32700 (LWP 956)):
#0  0x00007fd22272d6bb in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000fa8df0 in native_cond_timedwait (mutex=0x1a6c6d28, abstime=0x7fd10cf30630, cond=0x1a6c6e28) at /home/hzwenzhh/mysql-5.7.22/include/thr_cond.h:129
#2  my_cond_timedwait (mp=0x1a6c6d28, cond=0x1a6c6e28, abstime=0x7fd10cf30630) at /home/hzwenzhh/mysql-5.7.22/include/thr_cond.h:182
#3  inline_mysql_cond_timedwait (abstime=0x7fd10cf30630, mutex=0x1a6c6d28, that=0x1a6c6e28, src_file=<optimized out>, src_line=<optimized out>) at /home/hzwenzhh/mysql-5.7.22/include/mysql/
psi/mysql_thread.h:1229
#4  terminate_slave_thread (need_lock_term=<optimized out>, stop_wait_timeout=<optimized out>, slave_running=<optimized out>, term_cond=<optimized out>, term_lock=<optimized out>, thd=<optimized out>) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_slave.cc:1851
#5  terminate_slave_thread (thd=0x1a9b6000, term_lock=0x1a6c6d28, term_cond=0x1a6c6e28, slave_running=0x1a6c6ee4, stop_wait_timeout=0x7fd10cf306c0, need_lock_term=false) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_slave.cc:1785
#6  0x0000000000fac04f in terminate_slave_threads (need_lock_term=false, stop_wait_timeout=<optimized out>, thread_mask=<optimized out>, mi=0x1a8fa400) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_slave.cc:1661
#7  terminate_slave_threads (mi=0x1a8fa400, thread_mask=2, stop_wait_timeout=<optimized out>, need_lock_term=false) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_slave.cc:1639
#8  0x0000000000fb38f7 in stop_slave (thd=thd@entry=0x1a79e000, mi=mi@entry=0x1a8fa400, net_report=net_report@entry=true, for_one_channel=for_one_channel@entry=true, push_temp_tables_warning=push_temp_tables_warning@entry=0x7fd10cf30780) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_slave.cc:10199
#9  0x0000000000fb5389 in stop_slave_cmd (thd=0x1a79e000) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_slave.cc:796
#10 0x0000000000ded77f in mysql_execute_command (thd=thd@entry=0x1a79e000, first_level=first_level@entry=true) at /home/hzwenzhh/mysql-5.7.22/sql/sql_parse.cc:3438
#11 0x0000000000df0565 in mysql_parse (thd=thd@entry=0x1a79e000, parser_state=parser_state@entry=0x7fd10cf316b0) at /home/hzwenzhh/mysql-5.7.22/sql/sql_parse.cc:5582
#12 0x0000000000df1bb3 in dispatch_command (thd=thd@entry=0x1a79e000, com_data=com_data@entry=0x7fd10cf31d10, command=COM_QUERY) at /home/hzwenzhh/mysql-5.7.22/sql/sql_parse.cc:1458
#13 0x0000000000df240d in do_command (thd=0x1a79e000) at /home/hzwenzhh/mysql-5.7.22/sql/sql_parse.cc:999
#14 0x0000000000ea36c4 in handle_connection (arg=arg@entry=0x2f35740) at /home/hzwenzhh/mysql-5.7.22/sql/conn_handler/connection_handler_per_thread.cc:300
#15 0x000000000105d9e1 in pfs_spawn_thread (arg=0x1a62bb50) at /home/hzwenzhh/mysql-5.7.22/storage/perfschema/pfs.cc:2190
#16 0x00007fd222728b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#17 0x00007fd22148efbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x0000000000000000 in ?? ()

b. worker thread will go into MGR before_commit hook and wants the read lock when calling channel_get_thread_id(...)
Thread 4 (Thread 0x7fd222de9700 (LWP 1068)):
#0  0x00007fd22272c8a3 in pthread_rwlock_rdlock () from /lib/x86_64-linux-gnu/libpthread.so.0
#1  0x0000000000fbdd8f in native_rw_rdlock (rwp=0x1a863138) at /home/hzwenzhh/mysql-5.7.22/include/thr_rwlock.h:79
#2  inline_mysql_rwlock_rdlock (that=0x1a863138, src_file=<optimized out>, src_line=<optimized out>) at /home/hzwenzhh/mysql-5.7.22/include/mysql/psi/mysql_thread.h:895
#3  rdlock (this=0x1a863130) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_gtid.h:471
#4  Master_info::channel_rdlock (this=<optimized out>) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_mi.cc:609
#5  0x0000000000fcf53e in channel_get_thread_id (channel=<optimized out>, thd_type=CHANNEL_APPLIER_THREAD, thread_id=0x7fd222de78e0) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_channel_service_interface.cc:659
#6  0x00007fd10aab454c in Replication_thread_api::is_own_event_applier (this=<optimized out>, id=12, channel_name=<optimized out>) at /home/hzwenzhh/mysql-5.7.22/rapid/plugin/group_replication/src/replication_threads_api.cc:260
#7  0x00007fd10aaa7d8e in group_replication_trans_before_commit (param=0x7fd222de7a90) at /home/hzwenzhh/mysql-5.7.22/rapid/plugin/group_replication/src/observer_trans.cc:271
#8  0x0000000000d6a55e in Trans_delegate::before_commit (this=0x1f853a0, thd=<optimized out>, all=<optimized out>, trx_cache_log=<optimized out>, stmt_cache_log=<optimized out>, cache_log_max_size=<optimized out>) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_handler.cc:325
#9  0x0000000000f98936 in MYSQL_BIN_LOG::commit (this=0x1fa8680, thd=0x1aa50000, all=true) at /home/hzwenzhh/mysql-5.7.22/sql/binlog.cc:8595
#10 0x000000000099413e in ha_commit_trans (thd=0x1aa50000, all=true, ignore_global_read_lock=false) at /home/hzwenzhh/mysql-5.7.22/sql/handler.cc:1796
#11 0x0000000000e86fa9 in trans_commit (thd=0x1aa50000) at /home/hzwenzhh/mysql-5.7.22/sql/transaction.cc:239
#12 0x0000000000f6b6c2 in Xid_log_event::do_commit (this=0x2e2cfd00, thd_arg=0x1aa50000) at /home/hzwenzhh/mysql-5.7.22/sql/log_event.cc:7055
#13 0x0000000000f6b65b in Xid_apply_log_event::do_apply_event_worker (this=0x2e2cfd48, w=0x1a9e2000) at /home/hzwenzhh/mysql-5.7.22/sql/log_event.cc:7150
#14 0x0000000000fca70b in slave_worker_exec_job_group (worker=0x1a9e2000, rli=0x1a6c6800) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_rli_pdb.cc:2594
#15 0x0000000000faf9db in handle_slave_worker (arg=arg@entry=0x1a9e2000) at /home/hzwenzhh/mysql-5.7.22/sql/rpl_slave.cc:6169
#16 0x000000000105d9e1 in pfs_spawn_thread (arg=0x1a62b970) at /home/hzwenzhh/mysql-5.7.22/storage/perfschema/pfs.cc:2190
#17 0x00007fd222728b50 in start_thread () from /lib/x86_64-linux-gnu/libpthread.so.0
#18 0x00007fd22148efbd in clone () from /lib/x86_64-linux-gnu/libc.so.6
#19 0x0000000000000000 in ?? ()

as stop slave thread has got the write lock, worker thread could not get the read lock, so worker thread will not exit, and so do stop slave thread.

How to repeat:
1. create a mgr group with three nodes using single primary mode. node1 is primary.
2. run sysbench prepare on node1;
3. execute "stop slave sql_thread for channel 'group_replication_applier'" on node2;
4. deadlock happened

Hello Zhenghu,

Thank you for the report and feedback.
I tried to follow your steps from "how to repeat" but not seeing any issues i.e no blocking etc observed. GR set up was same as in Bug#91042(91042_5.7.22.results) and sysbench 1.0.13 (using bundled LuaJIT 2.1.0-beta2) and tried almost all tests such as below:

bulk_insert.lua
oltp_common.lua
oltp_delete.lua
oltp_insert.lua
oltp.lua
oltp_point_select.lua
oltp_read_only.lua
oltp_read_write.lua
oltp_update_index.lua
oltp_update_non_index.lua
oltp_write_only.lua
select_random_points.lua
select_random_ranges.lua

bin/sysbench --tables=10 --table-size=5000000 --threads=50 --mysql-db=sbtest --mysql-user=root --mysql-socket=/tmp/mysql_hod03.sock --time=300 share/sysbench/select_random_points.lua prepare

^^ While above was running, tried on non-primary node

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 5e18a539-898e-11e8-8d96-0010e05f3e06 | hod03       |        3333 | ONLINE       |
| group_replication_applier | 64179dc1-898e-11e8-9db2-0010e05f4178 | hod04       |        3333 | ONLINE       |
| group_replication_applier | 6a03f852-898e-11e8-ac3f-0010e0734b98 | hod06       |        3333 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

mysql> SHOW STATUS LIKE 'group_replication_primary_member';
+----------------------------------+--------------------------------------+
| Variable_name                    | Value                                |
+----------------------------------+--------------------------------------+
| group_replication_primary_member | 5e18a539-898e-11e8-8d96-0010e05f3e06 |
+----------------------------------+--------------------------------------+
1 row in set (0.00 sec)

mysql> stop slave sql_thread for channel 'group_replication_applier';
Query OK, 0 rows affected (0.02 sec)

mysql> stop slave sql_thread for channel 'group_replication_applier';
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> start slave sql_thread for channel 'group_replication_applier';
Query OK, 0 rows affected (0.00 sec)

Could you please provide exact sysbench prepare command you were using, config details from all the nodes? Thank you!

Thanks,
Umesh

Hi, Umesh:

sysbench version is:

hzwenzhh@db-181:~$ /home/hzwenzhh/sysbench/bin/sysbench --version
sysbench 0.5

sysbench prepare script is:
./mysql/bin/mysql -uroot -S ./node1/mysql.sock  -e 'create database sbtest'
/home/hzwenzhh/sysbench/bin/sysbench --mysql-host=127.0.0.1  --mysql-port=10001 --mysql-user=rpl_user --mysql-password=rpl_pass --mysql-db=sbtest  --test=/home/hzwenzhh/sysbench/db/oltp.lua   --oltp_tables_count=10 --oltp-table-size=1000000000 --rand-init=on --num-threads=256  --report-interval=2 prepare

Hi, Umesh:

may be this script will help you:
./stop-slave-applier-loop.sh 2

hzwenzhh@db-181:~$ cat stop-slave-applier-loop.sh 
while true;
do
echo "begin stop slave applier channel"
./mysql/bin/mysql -uroot -S ./node$1/mysql.sock  -e "stop slave sql_thread for channel 'group_replication_applier'"
echo "finish stop slave, and start again"
./mysql/bin/mysql -uroot -S ./node$1/mysql.sock  -e "start slave sql_thread for channel 'group_replication_applier'"
echo "finish start slave, will do next loop..."
sleep 5
done

Thank you for the details.
Verified as described.

Thanks,
Umesh

Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 5.7.25 / 8.0.14 changelog:
When stopping replication, any channels that had pending transactions could cause a deadlock in Group Replication.