MySQL Bugs: #87235: "STOP GROUP_REPLICATION" cause deadlock

Bug #87235	"STOP GROUP_REPLICATION" cause deadlock
Submitted:	28 Jul 2017 7:33	Modified:	26 Sep 2017 14:23
Reporter:	Seunguck Lee	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S2 (Serious)
Version:	5.7.19	OS:	CentOS (Linux localhost 3.10.0-514.26.2.el7.x86_64 #1 SMP Tue Jul 4 15:04:05 UTC 2017 x86)
Assigned to:		CPU Architecture:	Any

Description:
MySQL server is frozen after issue "STOP GROUP_REPLICATION;" command on Primary (on single primary group replication).
But interesting thing is no problem with other remain members (They are elected new primary and no problem with INSERT and SELECT..)

mysql> show full processlist;
+-----+----------------------------+--------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Id  | User                       | Host               | db   | Command | Time | State                        | Info                                                                                                                                                                                                 |
+-----+----------------------------+--------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 157 | root                       | localhost          | NULL | Query   |  285 | starting                     | stop group_replication                                                                                                                                                                               |
| 167 | root                       | 192.168.0.2:58090  | test | Query   |  285 | query end                    | insert into test values (null, now())                                                                                                                                                                |
| 168 | mysql.session              | localhost          | NULL | Query   |  276 | Waiting for global read lock | PLUGIN: SET GLOBAL read_only= 1                                                                                                                                                                      |
| 169 | mysql_router1_kg12m0n956rc | 192.168.0.2:58096  | NULL | Query   |  230 | statistics                   | SELECT member_id, member_host, member_port, member_state, @@group_replication_single_primary_mode FROM performance_schema.replication_group_members WHERE channel_name = 'group_replication_applier' |
| 170 | root                       | localhost          | NULL | Query   |    0 | starting                     | show full processlist                                                                                                                                                                                |
+-----+----------------------------+--------------------+------+---------+------+------------------------------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

After time goes by, increase user "mysql_router1_kg12m0n956rc"'s session and query(blocked).

How to repeat:
I am not sure this is constant case or some timing issue.
Anyway this case is happen when I tested below 3 session case.

>> Session-1
  while(true){
      insert into test values (NULL, now());  ## on "read-write" channel via mysql router
      sleep(1 millis);
  }

>> Session-2
  while(true){
      select 1 from test limit 1  ## on "read-only" channel via mysql router
      sleep(1 millis);
  }

>> Session-3
  stop group_replication  ## on current primary member

Stack trace

Attachment: GroupReplication-Deadlock.txt (text/plain), 61.58 KiB.

meet the same situation :

- single primary mode
- stop group replication on primary node
- schedule an event to insert row into one table
- dead lock occur

processlist:

mysql> select * from information_schema.processlist order by time desc;
+------+-----------------+-----------+------+---------+------+------------------------------+-----------------------------------------------------------------+
| ID   | USER            | HOST      | DB   | COMMAND | TIME | STATE                        | INFO                                                            |
+------+-----------------+-----------+------+---------+------+------------------------------+-----------------------------------------------------------------+
| 5283 | root            | localhost | NULL | Sleep   | 1341 |                              | NULL                                                            |
| 5316 | root            | localhost | NULL | Query   |  152 | starting                     | stop group_replication                                          |
| 5657 | root            | localhost | test | Connect |  145 | query end                    | insert into test.t3 values(null)                                |
| 5658 | mysql.session   | localhost | NULL | Query   |  144 | Waiting for global read lock | PLUGIN: SET GLOBAL super_read_only= 1                           |
| 5291 | event_scheduler | localhost | NULL | Daemon  |  135 | Waiting for global read lock | NULL                                                            |
| 5661 | root            | localhost | NULL | Query   |    0 | executing                    | select * from information_schema.processlist order by time desc |
+------+-----------------+-----------+------+---------+------+------------------------------+-----------------------------------------------------------------+
6 rows in set (0.00 sec)

when I use "kill 5658" to kill the internal query on MGR "PLUGIN: SET GLOBAL super_read_only= 1 ", deadlock disappear...

pstack when deadlock occur

Attachment: pstack.log (application/octet-stream, text), 24.43 KiB.

what's worse, when kill the session of "set global super_read_only = 1", data inconsistency occur when I try to rejoin this node back to the group

Hi,

Thanks for the report. I cannot reliably reproduce this issue but I did manage to reproduce it few times. Looks like the freeze happens when doing stop replication while few sessions are doing insert and few sessions are doing select. That is a "normal" state of any db system (sessions running read/write), so I was not able to reliably find a 100% reproduction case. 

If any of you is able to offer an easy to reproduce test case I'd very much appreciate (that would for sure speed up fixing of this bug).

Thanks
Bogdan

Hi Bogdan Kecman, maybe you can use sysbench transaction write test to  reproduce this situation.

not reproducible any more with 5.7.20

Posted by developer:
 

This bug was fixed on wl#10960 Backport wl#10611 to MySQL 5.7.20.