Bug #98151 group replication with wrong member_state after server shutdown
Submitted: 8 Jan 2020 1:03 Modified: 12 Sep 2023 11:14
Reporter: phoenix Zhang (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version:8.0.18 OS:Any
Assigned to: CPU Architecture:Any
Tags: group replication

[8 Jan 2020 1:03] phoenix Zhang
Description:
In an group replication cluster, when secondly member fail-over abnormal, the member state will be UNREACHABLE. But after a long while, it become ONLINE, while this instance still down.

How to repeat:
Step 1: First, init an group replication cluster with 3 instance.

phoenix@phoenix-Latitude-5491:~/gitlab/myrocks$ mysql -uroot -P13000 -h127.0.0.1 test

mysql> set global group_replication_enforce_update_everywhere_checks=OFF;
SET global group_replication_group_seeds='127.0.0.1:33061,127.0.0.1:33062,127.0.0.1:33063';
SET global group_replication_local_address='127.0.0.1:33061';
set global group_replication_group_name=BIN_TO_UUID(CAST('1' as BINARY(16)));
set global group_replication_single_primary_mode=ON;
set GLOBAL group_replication_bootstrap_group=ON; 
CHANGE MASTER TO MASTER_USER='root', MASTER_PASSWORD='' FOR CHANNEL 'group_replication_recovery';  
reset master;start group_replication;
Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected, 1 warning (0.04 sec)

Query OK, 0 rows affected (0.02 sec)

Query OK, 0 rows affected (3.19 sec)

mysql> quit
Bye

phoenix@phoenix-Latitude-5491:~/gitlab/myrocks$ mysql -uroot -P13001 -h127.0.0.1 test

mysql> set global group_replication_enforce_update_everywhere_checks=OFF;
SET global group_replication_group_seeds='127.0.0.1:33061,127.0.0.1:33062,127.0.0.1:33063';
SET global group_replication_local_address='127.0.0.1:33062';
set global group_replication_group_name=BIN_TO_UUID(CAST('1' as BINARY(16)));
set global group_replication_single_primary_mode=ON;
set GLOBAL group_replication_bootstrap_group=OFF;
CHANGE MASTER TO MASTER_USER='root', MASTER_PASSWORD='' FOR CHANNEL 'group_replication_recovery'; 
reset master;start group_replication;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected, 1 warning (0.02 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (3.96 sec)

mysql> quit
Bye

phoenix@phoenix-Latitude-5491:~/gitlab/myrocks$ mysql -uroot -P13002 -h127.0.0.1 test

mysql> set global group_replication_enforce_update_everywhere_checks=OFF;
SET global group_replication_group_seeds='127.0.0.1:33061,127.0.0.1:33062,127.0.0.1:33063';
SET global group_replication_local_address='127.0.0.1:33063';
et global group_replication_group_name=BIN_TO_UUID(CAST('1' as BINARY(16)));
set global group_replication_single_primary_mode=ON;
set GLOBAL group_replication_bootstrap_group=OFF; 
CHANGE MASTER TO MASTER_USER='root', MASTER_PASSWORD='' FOR CHANNEL 'group_replication_recovery'; 
reset master;
start group_replication;
Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.01 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected, 1 warning (0.03 sec)

Query OK, 0 rows affected (0.00 sec)

Query OK, 0 rows affected (4.38 sec)

mysql> quit
Bye

Step 2: now, check from port 13000, it have 3 members online

phoenix@phoenix-Latitude-5491:~/gitlab/myrocks$ mysql -uroot -P13000 -h127.0.0.1 test

mysql> select * from performance_schema.replication_group_members;                                                                                                                                          +---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | b731e320-3132-11ea-81b4-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.18         |
| group_replication_applier | b73ab886-3132-11ea-9dab-c8f7507e5048 | 127.0.0.1   |       13001 | ONLINE       | SECONDARY   | 8.0.18         |
| group_replication_applier | b747cde5-3132-11ea-a8ce-c8f7507e5048 | 127.0.0.1   |       13002 | ONLINE       | SECONDARY   | 8.0.18         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.01 sec)

Step 3: Then, use kill -9 command to shutdown the server with port 13002. From port 13000, it will be 2 members finally.

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | b731e320-3132-11ea-81b4-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.18         |
| group_replication_applier | b73ab886-3132-11ea-9dab-c8f7507e5048 | 127.0.0.1   |       13001 | ONLINE       | SECONDARY   | 8.0.18         |
| group_replication_applier | b747cde5-3132-11ea-a8ce-c8f7507e5048 | 127.0.0.1   |       13002 | UNREACHABLE  | SECONDARY   | 8.0.18         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
3 rows in set (0.01 sec)

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | b731e320-3132-11ea-81b4-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.18         |
| group_replication_applier | b73ab886-3132-11ea-9dab-c8f7507e5048 | 127.0.0.1   |       13001 | ONLINE       | SECONDARY   | 8.0.18         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)

step 4: Then use kill -9 for server with port 13001. From port 13000, the state will become below.

mysql> select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | b731e320-3132-11ea-81b4-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.18         |
| group_replication_applier | b73ab886-3132-11ea-9dab-c8f7507e5048 | 127.0.0.1   |       13001 | UNREACHABLE  | SECONDARY   | 8.0.18         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.00 sec)

Step 5: That state will be stable for a long time. But I use an quite simple shell script to monitor the state from 13000, after long while (in my test, it takes about 5.5h), the state of 13001 become ONLINE now, while this server still shutdown.

phoenix@phoenix-Latitude-5491:~/gitlab/myrocks$ mysql -uroot -P13000 -h127.0.0.1 test

mysql> select * from performance_schema.replication_group_members;

+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | MEMBER_ROLE | MEMBER_VERSION |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
| group_replication_applier | b731e320-3132-11ea-81b4-c8f7507e5048 | 127.0.0.1   |       13000 | ONLINE       | PRIMARY     | 8.0.18         |
| group_replication_applier | b73ab886-3132-11ea-9dab-c8f7507e5048 | 127.0.0.1   |       13001 | ONLINE       | SECONDARY   | 8.0.18         |
+---------------------------+--------------------------------------+-------------+-------------+--------------+-------------+----------------+
2 rows in set (0.02 sec)

mysql> quit
Bye
phoenix@phoenix-Latitude-5491:~/gitlab/myrocks$ mysql -uroot -P13001 -h127.0.0.1 test
ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1' (111)
[26 Feb 2020 5:17] MySQL Verification Team
Hi,

I am having issues reproducing this. Can you share your full config please.

Thanks
Bogdan
[2 Mar 2020 0:55] phoenix Zhang
This is the config file of port of 13000

Attachment: my13000.cnf (application/octet-stream, text), 528 bytes.

[2 Mar 2020 0:56] phoenix Zhang
This is the config file of port of 13001

Attachment: my13001.cnf (application/octet-stream, text), 528 bytes.

[2 Mar 2020 0:56] phoenix Zhang
This is the config file of port of 13002

Attachment: my13002.cnf (application/octet-stream, text), 528 bytes.

[17 Mar 2020 17:14] MySQL Verification Team
Thanks for the report. Verified. I'm not 100% sure this is a bug let's see what the GR team will say but I reproduced the behavior.

all best
Bogdan
[20 Aug 2020 6:02] Bin Wang
Obviously it is a bug caused by TCP self-connect
[20 Aug 2020 6:18] Bin Wang
From netstat, we can see the following:
netstat -n|grep 33062
tcp        0      0 127.0.0.1:33062         127.0.0.1:33062         ESTABLISHED

This is what I called the "TCP self-connect".
[20 Aug 2020 6:35] Bin Wang
TCP standard has "simultaneous open" feature.

The implication of the feature, client trying to connect to local port, when the port is from ephemeral range, can connect to itself.

So client think it's connected to server, while it actually connected to itself. From other side, server can not open its server port, since it's occupied/stolen by client.