Bug #89042 START GR HANGS AND WITH THAT NO NEW CONNECTION IS ALLOWED ON THAT SERVER
Submitted: 23 Dec 2017 6:47 Modified: 22 Feb 2018 11:05
Reporter: Narendra Chauhan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.5 OS:Any
Assigned to: CPU Architecture:Any

[23 Dec 2017 6:47] Narendra Chauhan
Description:
Scenario: Run start group_replication (forgot about setting bootstrap_group=ON), meanwhile from parallel session execute 'set global gtid_mode=ON'.

Expected output:- 
a) No hang.
b) There should be some return message to the client.
c) New connection should be allowed to connect to the server.

Actual Output:- a), b) and c) not happening.

Concern: Since, we are not able to connect to the server as well, we cannot check the processlist or kill the connection id etc.

Find the testcase in below section, which gets hang while cleaning up in "group_replication_end.inc".

Note:-
If we do NOT run 'SET GLOBAL GTID_MODE=ON' on parallel session we usually get following messages in log:-
2017-12-22T09:22:11.698904Z 0 [Warning] [MY-011254] Plugin group_replication reported: '[GCS] read failed'
2017-12-22T09:22:11.719080Z 0 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 12300'
2017-12-22T09:23:11.698200Z 7 [ERROR] [MY-011254] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2017-12-22T09:23:11.698263Z 7 [Note] [MY-011254] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2017-12-22T09:23:11.698308Z 7 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
2017-12-22T09:23:11.699344Z 18 [Note] [MY-010596] Error reading relay log event for channel 'group_replication_applier': slave SQL thread was killed
2017-12-22T09:23:11.700887Z 15 [Note] [MY-011254] Plugin group_replication reported: 'The group replication applier thread was killed'
2017-12-22T09:23:11.701902Z 0 [Note] [MY-011254] Plugin group_replication reported: 'Setting read_only=OFF.'

But, with it being set in parallel we are getting just following messages in log:
...
2017-12-22T09:28:44.734382Z 0 [Warning] [MY-011254] Plugin group_replication reported: '[GCS] read failed'
2017-12-22T09:28:44.743349Z 0 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 11300'
2017-12-22T09:29:44.718951Z 12 [ERROR] [MY-011254] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2017-12-22T09:29:44.719214Z 12 [ERROR] [MY-011254] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'
...

How to repeat:
Details:-
=========
MySQL Version: 8.0.5

Run command:- ./mtr group_replication.gr_bug2

Where,
$ cat suite/group_replication/t/gr_bug2.test
--let $rpl_skip_group_replication_start= 1
--source ../inc/have_group_replication_plugin.inc
--source include/force_restart.inc
--source ../inc/group_replication.inc

--connect(server1_1, localhost,root,,,$MASTER_MYPORT)
--connect(server1_2, localhost,root,,,$MASTER_MYPORT)

--let $rpl_connection_name= server1
--source include/rpl_connection.inc
SET GLOBAL group_replication_group_name= "aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa";
SET GLOBAL group_replication_local_address= "127.0.0.1:11300";
SET GLOBAL group_replication_group_seeds= "127.0.0.1:11300, 127.0.0.1:11301";
SET GLOBAL group_replication_bootstrap_group= OFF;
--send START GROUP_REPLICATION

--let $rpl_connection_name= server1_1
--source include/rpl_connection.inc
--sleep 10
--send SET GLOBAL GTID_MODE=ON

--echo #
--let $rpl_connection_name= server1_2
--source include/rpl_connection.inc
SHOW PROCESSLIST;

--source ../inc/group_replication_end.inc
[3 Jan 2018 17:22] Anibal Pinto
Posted by developer:
 

On rpl_group_replication.cc, function group_replication_start, we have:

103       In order to prevent a concurrent client from executing SET
104       GTID_MODE=ON_PERMISSIVE between 1 and 2, we must hold
105       gtid_mode_lock.
106     */
107     gtid_mode_lock->rdlock();

When executing the statement SET GLOBAL GTID_MODE=ON it will block, it will wait
to lock gtid_mode_lock, a part of the stacktrace:

#4  Sys_var_gtid_mode::global_update at sql/sys_vars.cc:4007
#5  sys_var::update at sql/set_var.cc:254

On sys_vars.cc we have:

4007   gtid_mode_lock->wrlock();

Already locked on group_replication_start.

As a consequence, SET GLOBAL read_only= 0 blocks on sql/set_var.cc:

252     AutoWLock lock1(&PLock_global_system_variables);

Due already being locked by statement SET GLOBAL GTID_MODE=ON as can be seen on
the stacktrace.

The best approach seems to change SET GLOBAL GTID_MODE=ON to return an error if
is not able to obtain the lock.
[22 Feb 2018 11:05] David Moss
Posted by developer:
 
Thank you for your feedback, this has been fixed in upcoming versions and the following was added to the 8.0.11 changelog:
After issuing START GROUP_REPLICATION the GTID_MODE is locked to prevent any modification to its value until the group is online. Any attempt to try and change GTID_MODE during this time is blocked. As part of the process of starting Group Replication the server needs to set super_read_only=off, which has dependencies on locks acquired by SET GTID_MODE. This could result in Group Replication hanging and there was no possibility to connect to the server to resolve the situation. To prevent this situation, when it is not possible to acquire the locks needed by SET GTID_MODE the operation aborts.
[31 May 2018 14:53] David Moss
Reclosing.