Bug #86581 GCS timeout waiting for executing stop group_replication when loss majority
Submitted: 5 Jun 2017 4:48 Modified: 14 Jun 2017 3:34
Reporter: jianxin chen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S4 (Feature request)
Version:5.7.18 OS:Any
Assigned to: CPU Architecture:Any
Tags: group_replication

[5 Jun 2017 4:48] jianxin chen
Description:
group replication GCS timeout waiting for executing "stop group_replication" when the group loss majority, it will always take more than 30s to executed this command.

How to repeat:
(1) Make two node to form one replication group in the mode of single primary, "10.202.7.88:24802" is primary node

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 5f113098-41f6-11e7-9506-fa163ee40410 | 10.202.7.88 |       24802 | ONLINE       |
| group_replication_applier | f777ea65-41f7-11e7-90a2-fa163ee40410 | 10.202.7.88 |       24803 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)

(2) Then kill the process id of primary node, and we got these view:

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 5f113098-41f6-11e7-9506-fa163ee40410 | 10.202.7.88 |       24802 | UNREACHABLE  |
| group_replication_applier | f777ea65-41f7-11e7-90a2-fa163ee40410 | 10.202.7.88 |       24803 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)

(3) Now I want to make "10.202.7.88:24803" leave the group, so I run "stop group_replication" command on it.

mysql> stop group_replication;
Query OK, 0 rows affected (35.13 sec)

It will cost more than 30s to be return! Finally we found there are some error message in the error log, like these :

2017-06-02T16:48:15.064662+08:00 135 [Note] Plugin group_replication reported: 'Going to wait for view modification'
2017-06-02T16:48:45.065303+08:00 0 [ERROR] Plugin group_replication reported: '[GCS] Timeout while waiting for the group communication engine to exit!'
2017-06-02T16:48:45.065477+08:00 0 [ERROR] Plugin group_replication reported: '[GCS] The member has failed to gracefully leave the group.'
2017-06-02T16:48:45.065745+08:00 0 [Note] Plugin group_replication reported: 'state 4330 action xa_terminate'
2017-06-02T16:48:45.066149+08:00 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-06-02T16:48:45.066175+08:00 0 [Note] Plugin group_replication reported: 'state 4257 action xa_exit'
2017-06-02T16:48:45.066226+08:00 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'
2017-06-02T16:48:45.066242+08:00 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-06-02T16:48:45.066451+08:00 0 [Warning] Plugin group_replication reported: 'read failed'
2017-06-02T16:48:50.202781+08:00 135 [Note] Plugin group_replication reported: 'auto_increment_increment is reset to 1'
2017-06-02T16:48:50.202903+08:00 135 [Note] Plugin group_replication reported: 'auto_increment_offset is reset to 1'
2017-06-02T16:48:50.203197+08:00 72 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
2017-06-02T16:48:50.208388+08:00 69 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'

Please pay attention to the first two lines, time interval is about 30s.

Suggested fix:
I guess it will happen when no enough majority nodes to certificate the view change(stop group_replication will make one view change event) and it will block until GCS meet timeout.

I don't know if it is reasonable to wait 30s and I found there is no way to customize it.

I guess the timeout of 30s comes from 'gcs_xcom_utils.cc':

static const uint64_t WAITING_TIME= 30;

Is it reasonable to make this timeout configuration to be customized or there are any solutions to avoid blocking until timeout?
[14 Jun 2017 3:34] MySQL Verification Team
Hi,

This is not a bug. I do believe 30sec is reasonable time for timeout here and I personally would not change it (especially I would not lower it) but I do agree it would be nice to have this configurable so I'm changing the bug status to feature request.

thanks for your submission
Bogdan Kecman