MySQL Bugs: #101237: stop group_replicaiton may block long time when restart server

Bug #101237	stop group_replicaiton may block long time when restart server
Submitted:	20 Oct 2020 8:38	Modified:	27 Oct 2020 7:34
Reporter:	phoenix Zhang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S3 (Non-critical)
Version:	8.0.21	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	group_replication

Description:
In a 5 nodes cluster, and all set group_replication_start_on_boot=ON. When all nodes shutdown, and restart and the same time, then in each node, execute `STOP GROUP_REPLICATION` statement will block long time.

If persist group_replication_start_on_boot, when restart, it will invoke `plugin_group_replication_start` internal, then it will hold the lv.plugin_running_mutex lock, this is why `stop group_repliation` blocked.

How to repeat:
1. first, use mysqld_safe to start 5 mysql server

   ./bin/mysqld --defaults-file=my13000.cnf --user=mysql --initialize-insecure
   ./bin/mysqld --defaults-file=my13001.cnf --user=mysql --initialize-insecure
   ./bin/mysqld --defaults-file=my13002.cnf --user=mysql --initialize-insecure
   ./bin/mysqld --defaults-file=my13003.cnf --user=mysql --initialize-insecure
   ./bin/mysqld --defaults-file=my13004.cnf --user=mysql --initialize-insecure

   ./bin/mysqld_safe --defaults-file=my13000.cnf --user=mysql &
   ./bin/mysqld_safe --defaults-file=my13001.cnf --user=mysql &
   ./bin/mysqld_safe --defaults-file=my13002.cnf --user=mysql &
   ./bin/mysqld_safe --defaults-file=my13003.cnf --user=mysql &
   ./bin/mysqld_safe --defaults-file=my13004.cnf --user=mysql &

2. build 5 nodes cluster into group_replication cluster, then, in each node, `set persist group_replication_start_on_boot=on`

connect 13000-13004:
mysql> install plugin group_replication soname 'group_replication.so';                                                                                                                 Query OK, 0 rows affected (0.01 sec)

mysql> CHANGE MASTER TO MASTER_USER="root", MASTER_PASSWORD="" FOR CHANNEL "group_replication_recovery";
Query OK, 0 rows affected, 1 warning (0.03 sec)

mysql> reset master;
Query OK, 0 rows affected (0.05 sec)

// only in 13000
mysql> SET GLOBAL group_replication_bootstrap_group=ON;
Query OK, 0 rows affected (0.00 sec)

mysql> start group_replication;                                                                                                                                                                             Query OK, 0 rows affected (33.75 sec)

mysql> set persist group_replication_start_on_boot=on;
Query OK, 0 rows affected (0.00 sec)

3. the group_replication now work normally

mysql> SELECT * FROM performance_schema.replication_group_members;                                                                                                                                          +---------------------------+--------------------------------------+-----------------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST           | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-----------------------+-------------+--------------+-------------+----------------+
| group_replication_applier | 04828891-12aa-11eb-9a50-c8f7507e5048 | *** |       13004 | ONLINE       | SECONDARY   | 8.0.21         |
| group_replication_applier | a898ca67-c6fb-11ea-9567-c8f7507e5048 | *** |       13000 | ONLINE       | PRIMARY     | 8.0.21         |
| group_replication_applier | ad672129-c6fb-11ea-a1ad-c8f7507e5048 | ***             |       13001 | ONLINE       | SECONDARY   | 8.0.21         |
| group_replication_applier | b2da5323-c6fb-11ea-9186-c8f7507e5048 | ***             |       13002 | ONLINE       | SECONDARY   | 8.0.21         |
| group_replication_applier | ffa85f73-12a9-11eb-8f52-c8f7507e5048 | *** |       13003 | ONLINE       | SECONDARY   | 8.0.21         |
+---------------------------+--------------------------------------+-----------------------+-------------+--------------+-------------+----------------+
5 rows in set (0.00 sec)

4. kill all the 5 node, and mysqld_safe will restart

# kill -9 22791 23203 23794 25090 25452
2020-10-20T08:02:04.712271Z mysqld_safe Number of processes running now: 0
2020-10-20T08:02:04.716119Z mysqld_safe mysqld restarted
2020-10-20T08:02:04.723246Z mysqld_safe Number of processes running now: 0
2020-10-20T08:02:04.725689Z mysqld_safe mysqld restarted
2020-10-20T08:02:04.728540Z mysqld_safe Number of processes running now: 0
2020-10-20T08:02:04.730970Z mysqld_safe mysqld restarted
2020-10-20T08:02:04.742896Z mysqld_safe Number of processes running now: 0
2020-10-20T08:02:04.745009Z mysqld_safe mysqld restarted
2020-10-20T08:02:04.762546Z mysqld_safe Number of processes running now: 0
2020-10-20T08:02:04.765004Z mysqld_safe mysqld restarted

5. connect to 13000, stop group_replication will block for long time

mysql> stop group_replication;                                                                                                                                                                              Query OK, 0 rows affected (26 min 22.48 sec)

my13000.cnf

Attachment: my13000.cnf (application/octet-stream, text), 1.08 KiB.

my13001.cnf

Attachment: my13001.cnf (application/octet-stream, text), 1.08 KiB.

my13002.cnf

Attachment: my13002.cnf (application/octet-stream, text), 1.08 KiB.

my13003.cnf

Attachment: my13003.cnf (application/octet-stream, text), 1.08 KiB.

my13004.cnf

Attachment: my13004.cnf (application/octet-stream, text), 1.08 KiB.

in function `xcom_send_app_wait_and_get`, it will invoke `xcom_send_client_app_data(fd, a, force);`, and then `rp = socket_read_msg(fd, p);`

when rp == REQUEST_RETRY, it will sleep 1s, and retry, this will cost 10s every time. it seems if both nodes start the xcom port, it may return REQUEST_RETRY

So, when all 5 nodes restart at the same time, all will retry again and again in xcom_send_app_wait_and_get(which cost 10s), then the booting thread will hold the lock, lead other group_replication operation cannot execute correctly

Hello phoenix Zhang!

Thank you for the report and feedback.
Verified as described with 8.0.21/22 builds.

regards,
Umesh