Bug #100299 secondly role cannot join to group_replication after fail-over
Submitted: 23 Jul 2020 2:57 Modified: 24 Feb 2022 11:12
Reporter: phoenix Zhang (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.18, 8.0.21 OS:Any
Assigned to: CPU Architecture:Any
Tags: group_replication

[23 Jul 2020 2:57] phoenix Zhang
Description:
For a 3-nodes group_replication cluster, with single-primary-mode=ON, and consistency-level=BEFORE_AND_AFTER

If n1 is the PRIMARY role, in one moment, there maybe:

1. n3 already write relay-log, while still not finish the transaction
1. n2 not write relay-log yet
1. n1 will hang cause no others nodes return message

Then, n3 shutdown first, after a while n1,n2 shutdown also.

Then we restart n1,n2, cause it restart major nodes, and contain the latest PRIMARY role, we can rebuild the group_replication with n1, n2.

After a while, n3 restart, and do START GROUP_REPLICATION, it will failed. Cause it has extra transaction not in n1/n2. Now, n3 cannot join to cluster again.

How to repeat:
I add some dbug code, and a test file.

n2 will hang in write_realy_log, and n3 will hang when execute statement, until they killed.

Suggested fix:
The reason is, when start group_replication, it will first execute transactions in relaylog. Which may lead one have more transaction then others. (This also happen in single-primary-mode=off).

Maybe it should reset the relay-log when a node rejoin to cluster.
[23 Jul 2020 2:57] phoenix Zhang
the test file patch

Attachment: gr_node_error_after_shutdown_cause_relaylog.diff (text/x-patch), 8.04 KiB.

[23 Jul 2020 2:59] phoenix Zhang
after compile, run the test with command:

./mtr group_replication.gr_create_db_shutdown --nocheck-testcase

The result will be:

include/group_replication.inc [rpl_server_count=3]
Warnings:
Note	####	Sending passwords in plain text without SSL/TLS is extremely insecure.
Note	####	Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
[connection server1]
connect conn1
SET PERSIST group_replication_local_address=@@group_replication_local_address, PERSIST group_replication_group_seeds=@@group_replication_group_seeds;
SELECT group_replication_set_as_primary(@@server_uuid);
group_replication_set_as_primary(@@server_uuid)
Primary server switched to: a38fe7ed-cc8d-11ea-b011-c8f7507e5048
SELECT * FROM performance_schema.replication_group_members;
CHANNEL_NAME	MEMBER_ID	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION
group_replication_applier	a38fe7ed-cc8d-11ea-b011-c8f7507e5048	127.0.0.1	13000	ONLINE	PRIMARY	8.0.18
group_replication_applier	a396eb5b-cc8d-11ea-ae31-c8f7507e5048	127.0.0.1	13002	ONLINE	SECONDARY	8.0.18
group_replication_applier	a3975f15-cc8d-11ea-9090-c8f7507e5048	127.0.0.1	13001	ONLINE	SECONDARY	8.0.18
CREATE TABLE t1 (c1 INT PRIMARY KEY, c2 INT) ENGINE=InnoDB;
SHOW TABLES;
Tables_in_test
t1
connect conn2
SET PERSIST group_replication_local_address=@@group_replication_local_address, PERSIST group_replication_group_seeds=@@group_replication_group_seeds;
SHOW TABLES;
Tables_in_test
t1
connect conn3
SET PERSIST group_replication_local_address=@@group_replication_local_address, PERSIST group_replication_group_seeds=@@group_replication_group_seeds;
SHOW TABLES;
Tables_in_test
t1
## n2 will not write relaylog
connect conn2
SET GLOBAL DEBUG='+d,wait_write_realylog_until_server_down';
## n3 will write relaylog, while not finish transaction
connect conn3
SET GLOBAL DEBUG='+d,wait_until_server_down';
connect conn1_2
INSERT INTO t1 VALUES (1,1),(2,2);;
## sleep 10 to ensure n3 already write relaylog
## then shutdown n3 first
connect conn3
[shutdown n3]
## wait n3 leave the group_replication
## then shutdown n2 later
connection conn2
SELECT * FROM performance_schema.replication_group_members;
CHANNEL_NAME	MEMBER_ID	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION
group_replication_applier	a38fe7ed-cc8d-11ea-b011-c8f7507e5048	127.0.0.1	13000	ONLINE	PRIMARY	8.0.18
group_replication_applier	a396eb5b-cc8d-11ea-ae31-c8f7507e5048	127.0.0.1	13002	UNREACHABLE	SECONDARY	8.0.18
group_replication_applier	a3975f15-cc8d-11ea-9090-c8f7507e5048	127.0.0.1	13001	ONLINE	SECONDARY	8.0.18
[shutdown n2]
connection conn1
## and then shutdown n1
[shutdown n1]
[start n3]
connect conn3
# restart
SELECT * FROM t1;
c1	c2
SELECT @@gtid_executed;
@@gtid_executed
a3aea551-cc8d-11ea-b011-c8f7507e5048:1-4
[start n2]
connect conn2
# restart
SELECT * FROM t1;
c1	c2
SELECT @@gtid_executed;
@@gtid_executed
a3aea551-cc8d-11ea-b011-c8f7507e5048:1-4
[start n1]
connect conn1
# restart
SELECT * FROM t1;
c1	c2
SELECT @@gtid_executed;
@@gtid_executed
a3aea551-cc8d-11ea-b011-c8f7507e5048:1-4
connect conn1
set global group_replication_bootstrap_group=on;
start group_replication;
set global group_replication_bootstrap_group=oFF;
SELECT * FROM t1;
c1	c2
INSERT INTO t1 VALUES (11,11),(22,22);
connect conn2
start group_replication;
SELECT * FROM performance_schema.replication_group_members;
CHANNEL_NAME	MEMBER_ID	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION
group_replication_applier	a38fe7ed-cc8d-11ea-b011-c8f7507e5048	127.0.0.1	13000	ONLINE	PRIMARY	8.0.18
group_replication_applier	a3975f15-cc8d-11ea-9090-c8f7507e5048	127.0.0.1	13001	ONLINE	SECONDARY	8.0.18
SELECT * FROM t1;
c1	c2
11	11
22	22
connect conn3
SELECT * FROM t1;
c1	c2
start group_replication;
ERROR HY000: The server is not configured properly to be an active member of the group. Please see more details on error log.
SELECT * FROM performance_schema.replication_group_members;
CHANNEL_NAME	MEMBER_ID	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION
group_replication_applier	a396eb5b-cc8d-11ea-ae31-c8f7507e5048	127.0.0.1	13002	OFFLINE		
SELECT * FROM t1;
c1	c2
1	1
2	2

From the result, it can find, n3 cannot join to cluster, and it will have (1,1),(2,2) rows not in the cluster
[23 Jul 2020 10:57] MySQL Verification Team
Hello phoenix Zhang!

Thank you for the report.

regards,
Umesh
[23 Jul 2020 10:57] MySQL Verification Team
MySQL Server 8.0.21 test results

Attachment: 100299_8.0.21.results.txt (text/plain), 5.35 KiB.

[24 Feb 2022 11:12] Margaret Fisher
Posted by developer:
 
Thanks for the information and apologies for the delay! Please find the new topic at
https://dev.mysql.com/doc/refman/8.0/en/group-replication-restarting-group.html