Description:
group replication GCS timeout waiting for executing "stop group_replication" when the group loss majority, it will always take more than 30s to executed this command.
How to repeat:
(1) Make two node to form one replication group in the mode of single primary, "10.202.7.88:24802" is primary node
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 5f113098-41f6-11e7-9506-fa163ee40410 | 10.202.7.88 | 24802 | ONLINE |
| group_replication_applier | f777ea65-41f7-11e7-90a2-fa163ee40410 | 10.202.7.88 | 24803 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)
(2) Then kill the process id of primary node, and we got these view:
mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 5f113098-41f6-11e7-9506-fa163ee40410 | 10.202.7.88 | 24802 | UNREACHABLE |
| group_replication_applier | f777ea65-41f7-11e7-90a2-fa163ee40410 | 10.202.7.88 | 24803 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0.00 sec)
(3) Now I want to make "10.202.7.88:24803" leave the group, so I run "stop group_replication" command on it.
mysql> stop group_replication;
Query OK, 0 rows affected (35.13 sec)
It will cost more than 30s to be return! Finally we found there are some error message in the error log, like these :
2017-06-02T16:48:15.064662+08:00 135 [Note] Plugin group_replication reported: 'Going to wait for view modification'
2017-06-02T16:48:45.065303+08:00 0 [ERROR] Plugin group_replication reported: '[GCS] Timeout while waiting for the group communication engine to exit!'
2017-06-02T16:48:45.065477+08:00 0 [ERROR] Plugin group_replication reported: '[GCS] The member has failed to gracefully leave the group.'
2017-06-02T16:48:45.065745+08:00 0 [Note] Plugin group_replication reported: 'state 4330 action xa_terminate'
2017-06-02T16:48:45.066149+08:00 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-06-02T16:48:45.066175+08:00 0 [Note] Plugin group_replication reported: 'state 4257 action xa_exit'
2017-06-02T16:48:45.066226+08:00 0 [Note] Plugin group_replication reported: 'Exiting xcom thread'
2017-06-02T16:48:45.066242+08:00 0 [Note] Plugin group_replication reported: 'new state x_start'
2017-06-02T16:48:45.066451+08:00 0 [Warning] Plugin group_replication reported: 'read failed'
2017-06-02T16:48:50.202781+08:00 135 [Note] Plugin group_replication reported: 'auto_increment_increment is reset to 1'
2017-06-02T16:48:50.202903+08:00 135 [Note] Plugin group_replication reported: 'auto_increment_offset is reset to 1'
2017-06-02T16:48:50.203197+08:00 72 [Note] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
2017-06-02T16:48:50.208388+08:00 69 [Note] Plugin group_replication reported: 'The group replication applier thread was killed'
Please pay attention to the first two lines, time interval is about 30s.
Suggested fix:
I guess it will happen when no enough majority nodes to certificate the view change(stop group_replication will make one view change event) and it will block until GCS meet timeout.
I don't know if it is reasonable to wait 30s and I found there is no way to customize it.
I guess the timeout of 30s comes from 'gcs_xcom_utils.cc':
static const uint64_t WAITING_TIME= 30;
Is it reasonable to make this timeout configuration to be customized or there are any solutions to avoid blocking until timeout?