Bug #87235 "STOP GROUP_REPLICATION" cause deadlock
Submitted: 28 Jul 2017 7:33 Modified: 26 Sep 2017 14:23
Reporter: Seunguck Lee Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:5.7.19 OS:CentOS (Linux localhost 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86)
Assigned to: CPU Architecture:Any

[28 Jul 2017 7:33] Seunguck Lee
Description:
MySQL server is frozen after issue "STOP GROUP_REPLICATION;" command on Primary (on single primary group replication).
But interesting thing is no problem with other remain members (They are elected new primary and no problem with INSERT and SELECT..)

mysql> show full processlist;
+-----+----------------------------+--------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Id  | User                       | Host               | db   | Command | Time | State                        | Info                                                                                                                                                                                                 |
+-----+----------------------------+--------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 157 | root                       | localhost          | NULL | Query   |  285 | starting                     | stop group_replication                                                                                                                                                                               |
| 167 | root                       | 192.168.0.2:58090  | test | Query   |  285 | query end                    | insert into test values (null, now())                                                                                                                                                                |
| 168 | mysql.session              | localhost          | NULL | Query   |  276 | Waiting for global read lock | PLUGIN: SET GLOBAL read_only= 1                                                                                                                                                                      |
| 169 | mysql_router1_kg12m0n956rc | 192.168.0.2:58096  | NULL | Query   |  230 | statistics                   | SELECT member_id, member_host, member_port, member_state, @@group_replication_single_primary_mode FROM performance_schema.replication_group_members WHERE channel_name = 'group_replication_applier' |
| 170 | root                       | localhost          | NULL | Query   |    0 | starting                     | show full processlist                                                                                                                                                                                |
+-----+----------------------------+--------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

After time goes by, increase user "mysql_router1_kg12m0n956rc"'s session and query(blocked).

How to repeat:
I am not sure this is constant case or some timing issue.
Anyway this case is happen when I tested below 3 session case.

>> Session-1
  while(true){
      insert into test values (NULL, now());  ## on "read-write" channel via mysql router
      sleep(1 millis);
  }

>> Session-2
  while(true){
      select 1 from test limit 1  ## on "read-only" channel via mysql router
      sleep(1 millis);
  }

>> Session-3
  stop group_replication  ## on current primary member
[28 Jul 2017 7:34] Seunguck Lee
Stack trace

Attachment: GroupReplication-Deadlock.txt (text/plain), 61.58 KiB.

[2 Aug 2017 1:46] Jackin Chan
meet the same situation :

- single primary mode
- stop group replication on primary node
- schedule an event to insert row into one table
- dead lock occur

processlist:

mysql> select * from information_schema.processlist order by time desc;
+------+-----------------+-----------+------+---------+------+------------------------------+-----------------------------------------------------------------+
| ID   | USER            | HOST      | DB   | COMMAND | TIME | STATE                        | INFO                                                            |
+------+-----------------+-----------+------+---------+------+------------------------------+-----------------------------------------------------------------+
| 5283 | root            | localhost | NULL | Sleep   | 1341 |                              | NULL                                                            |
| 5316 | root            | localhost | NULL | Query   |  152 | starting                     | stop group_replication                                          |
| 5657 | root            | localhost | test | Connect |  145 | query end                    | insert into test.t3 values(null)                                |
| 5658 | mysql.session   | localhost | NULL | Query   |  144 | Waiting for global read lock | PLUGIN: SET GLOBAL super_read_only= 1                           |
| 5291 | event_scheduler | localhost | NULL | Daemon  |  135 | Waiting for global read lock | NULL                                                            |
| 5661 | root            | localhost | NULL | Query   |    0 | executing                    | select * from information_schema.processlist order by time desc |
+------+-----------------+-----------+------+---------+------+------------------------------+-----------------------------------------------------------------+
6 rows in set (0.00 sec)

when I use "kill 5658" to kill the internal query on MGR "PLUGIN: SET GLOBAL super_read_only= 1 ", deadlock disappear...
[2 Aug 2017 1:48] Jackin Chan
pstack when deadlock occur

Attachment: pstack.log (application/octet-stream, text), 24.43 KiB.

[2 Aug 2017 1:56] Jackin Chan
what's worse, when kill the session of "set global super_read_only = 1", data inconsistency occur when I try to rejoin this node back to the group
[21 Aug 2017 18:23] Bogdan Kecman
Hi,

Thanks for the report. I cannot reliably reproduce this issue but I did manage to reproduce it few times. Looks like the freeze happens when doing stop replication while few sessions are doing insert and few sessions are doing select. That is a "normal" state of any db system (sessions running read/write), so I was not able to reliably find a 100% reproduction case. 

If any of you is able to offer an easy to reproduce test case I'd very much appreciate (that would for sure speed up fixing of this bug).

Thanks
Bogdan
[22 Aug 2017 1:13] Jackin Chan
Hi Bogdan Kecman, maybe you can use sysbench transaction write test to  reproduce this situation.
[26 Sep 2017 11:37] Bogdan Kecman
not reproducible any more with 5.7.20
[26 Sep 2017 14:23] Anibal Pinto
Posted by developer:
 

This bug was fixed on wl#10960 Backport wl#10611 to MySQL 5.7.20.