Bug #100163 xa commit failed when stop group_replication will lead node error
Submitted: 9 Jul 2020 0:56 Modified: 12 Jan 2022 21:40
Reporter: phoenix Zhang (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.18 OS:Any
Assigned to: CPU Architecture:Any
Tags: xa group_replication

[9 Jul 2020 0:56] phoenix Zhang
Description:
In group_replication cluster, the XA COMMIT command may failed when the node leave the cluster. In this way, the prepared transaction will rollback from engine, clear the xid state, and without binlog.

Then, when the node rejoin the cluster, the data will not same as other nodes in cluster. Then the node will be error state of group_repliation, and leave the cluster.

How to repeat:
I have a litter patch as below:

diff --git a/sql/xa.cc b/sql/xa.cc
index bd6bb19cf58..70a5dfc863f 100644
--- a/sql/xa.cc
+++ b/sql/xa.cc
@@ -677,6 +677,9 @@ bool Sql_cmd_xa_commit::process_external_xa_commit(THD *thd,
 bool Sql_cmd_xa_commit::process_internal_xa_commit(THD *thd,
                                                    XID_STATE *xid_state) {
   DBUG_TRACE;
+  if (DBUG_EVALUATE_IF("xa_commit_sleep", true, false)) {
+    sleep(4);
+  }
   bool res = false;
   bool gtid_error = false, need_clear_owned_gtid = false;

compile source code with debug mode, and run test-case in attach file.
[9 Jul 2020 0:57] phoenix Zhang
test file, need move to mysql-test/suite/group_replication/t

Attachment: gr_xa_commit_failed.test (application/octet-stream, text), 1.42 KiB.

[9 Jul 2020 0:59] phoenix Zhang
run the test as command:

./mtr group_replication.gr_xa_commit_failed --nocheck-testcase

The result output will be:

include/group_replication.inc [rpl_server_count=2]
Warnings:
Note	####	Sending passwords in plain text without SSL/TLS is extremely insecure.
Note	####	Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
[connection server1]
[connect conn1]
SELECT * from performance_schema.replication_group_members;
CHANNEL_NAME	MEMBER_ID	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION
group_replication_applier	817bd0ac-c17d-11ea-8a29-c8f7507e5048	127.0.0.1	13001	ONLINE	PRIMARY	8.0.18
group_replication_applier	817c2845-c17d-11ea-9f93-c8f7507e5048	127.0.0.1	13000	ONLINE	PRIMARY	8.0.18
CREATE TABLE t1 (c1 INT NOT NULL PRIMARY KEY, c2 INT);
INSERT INTO t1 VALUES (1,1);
include/rpl_sync.inc
[connect conn1_1]
FLUSH LOGS;
XA START '1';
INSERT INTO t1 VALUES (2,2);
XA END '1';
XA PREPARE '1';
SET SESSION DEBUG='+d,xa_commit_sleep';
XA COMMIT '1';
[connect conn1_2]
STOP GROUP_REPLICATION;
[connect conn1_1]
ERROR HY000: Error on observer while running replication hook 'before_commit'.
SET SESSION DEBUG='-d,xa_commit_failed';
XA RECOVER;
formatID	gtrid_length	bqual_length	data
SELECT * FROM t1;
c1	c2
1	1
SHOW BINLOG EVENTS IN 'server-binary-log.000002';
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
server-binary-log.000002	4	Format_desc	1	124	Server ver: 8.0.18-9-debug, Binlog ver: 4
server-binary-log.000002	124	Previous_gtids	1	191	8193a9b0-c17d-11ea-9f93-c8f7507e5048:1-4
server-binary-log.000002	191	Gtid	1	273	SET @@SESSION.GTID_NEXT= '8193a9b0-c17d-11ea-9f93-c8f7507e5048:5'
server-binary-log.000002	273	Query	1	364	XA START X'31',X'',1
server-binary-log.000002	364	Table_map	1	409	table_id: 127 (test.t1)
server-binary-log.000002	409	Write_rows	1	449	table_id: 127 flags: STMT_END_F
server-binary-log.000002	449	Query	1	538	XA END X'31',X'',1
server-binary-log.000002	538	XA_prepare	1	571	XA PREPARE X'31',X'',1
[connect conn1]
START GROUP_REPLICATION;
[connect conn2]
XA RECOVER;
formatID	gtrid_length	bqual_length	data
1	1	0	1
SELECT * FROM t1;
c1	c2
1	1
XA COMMIT '1';
XA RECOVER;
formatID	gtrid_length	bqual_length	data
SELECT * FROM t1;
c1	c2
1	1
2	2
SELECT * from performance_schema.replication_group_members;
CHANNEL_NAME	MEMBER_ID	MEMBER_HOST	MEMBER_PORT	MEMBER_STATE	MEMBER_ROLE	MEMBER_VERSION
group_replication_applier	817bd0ac-c17d-11ea-8a29-c8f7507e5048	127.0.0.1	13001	ONLINE	PRIMARY	8.0.18

From the result, we can find that node1 leave the group_replication. Error log will be:
line
2020-07-09T00:45:51.444850Z 29 [ERROR] [MY-011599] [Repl] Plugin group_replication reported: 'Transaction cannot be executed while Group Replication is stopping.'
2020-07-09T00:45:51.444872Z 29 [ERROR] [MY-010207] [Repl] Run function 'before_commit' in plugin 'group_replication' failed
2020-07-09T00:45:56.204536Z 36 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_applier': Error 'XAER_NOTA: Unknown XID' on query. Default database: 'test'. Query: 'XA COMMIT X'31',X'',1', Error_code: MY-001397
2020-07-09T00:45:56.204708Z 36 [Warning] [MY-010584] [Repl] Slave: XAER_NOTA: Unknown XID Error_code: MY-001397
2020-07-09T00:45:56.204769Z 36 [ERROR] [MY-011451] [Repl] Plugin group_replication reported: 'The applier thread execution was aborted. Unable to process more transactions, this member will now leave the group.'
2020-07-09T00:45:56.204973Z 33 [ERROR] [MY-011452] [Repl] Plugin group_replication reported: 'Fatal error during execution on the Applier process of Group Replication. The server will now leave the group.'
2020-07-09T00:45:56.205228Z 33 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
2020-07-09T00:45:56.205566Z 40 [ERROR] [MY-011625] [Repl] Plugin group_replication reported: 'Unable to ensure the execution of group transactions received during recovery.'
2020-07-09T00:45:56.205648Z 40 [ERROR] [MY-011620] [Repl] Plugin group_replication reported: 'Fatal error during the incremental recovery process of Group Replication. The server will leave the group.'
2020-07-09T00:45:56.205754Z 40 [Warning] [MY-011645] [Repl] Plugin group_replication reported: 'Skipping leave operation: concurrent attempt to leave the group is on-going.'
2020-07-09T00:45:56.205818Z 40 [ERROR] [MY-011712] [Repl] Plugin group_replication reported: 'The server was automatically set into read only mode after an error was detected.'
SET @@GLOBAL.super_read_only = @original_super_read_only;
^ Found warnings in /home/phoenix/gitlab/myrocks/DEBUG/mysql-test/var/log/mysqld.1.err
ok
[9 Jul 2020 12:19] MySQL Verification Team
Hello phoenix Zhang!

Thank you for the report.

regards,
Umesh
[12 Jan 2022 21:40] Jon Stephens
Fixed in MySQL 8.0.29 by WL#14700. See same for info.

Closed.