Bug #110880 XA COMMIT at before_commit return fail breaks group replication recovery
Submitted: 2 May 2023 3:50 Modified: 5 May 2023 5:35
Reporter: Zhejun Cai Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:8.0.32 OS:Any
Assigned to: CPU Architecture:Any

[2 May 2023 3:50] Zhejun Cai
Description:
2023-04-27T07:54:38.656853Z 60 [ERROR] [MY-010584] [Repl] Slave SQL for channel 'group_replication_recovery': Worker 1 failed executing transaction 'bbbbbbbb-bbbb-bbbb-bbbb-bbbbbbbbbbbb:1000003' at master log binlog.000002, end_log_pos 164998359; Error 'XAER_NOTA: Unknown XID' on query. Default database: ''. Query: 'XA COMMIT X'416e74646258696431373630333637373939373732393038323838',X'',1', Error_code: MY-001397

How to repeat:
Here is the scenario:
group_replication_single_primary_mode=TRUE

When XA COMMIT at before_commit return fail, it calls trx_coordinator::rollback_in_engines(thd, all), this action will rollback transaction in storage engine,
but XA BEGIN, XA END, XA PREPARE events have logged in binlog and synchronized to other replicas,

Kill the primary node , let it leave the cluster,

When one of the replicas become a new primary node, and do XA COMMIT successfully, it logs the 'XA COMMIT' events in binlog.

Restart the previous primary node, rejoin to the cluster, repicate the 'XA COMMIT' event from the new primary node, apply it, it will report Error 'XAER_NOTA: Unknown XID' on query

Suggested fix:

Is it an intended behavior or a bug? 
Are there any workarounds to recover it conveniently, I did not find relevant documentation to give a solution.

Whether the transaction should not be rollback in storage when XA COMMIT at before_commit return fail?
[2 May 2023 13:48] MySQL Verification Team
Hi Mr. Cai,

Thank you for your bug report.

However, your test case is non-existent.

We need entire test case, so that we can just run it and repeat it. Hence, we need all the tables, their contents and all the commands that you issued and in the correct order.

Also, please let us know if the problems repeats only on the standalone server, or in the InnoDB Cluster. If it repeats in the Cluster, we need all the detailed setup of your cluster and what to do exactly, step by step, in order for the bug to surface out.

Do also know that our current release is 8.0.33.

We are waiting on your full feedback.
[4 May 2023 6:58] Zhejun Cai
test case

Attachment: gr_xa_commit_failure_before_commit_hook.test (application/octet-stream, text), 4.52 KiB.

[4 May 2023 6:58] Zhejun Cai
configure for the test case

Attachment: gr_xa_commit_failure_before_commit_hook.cnf (application/octet-stream, text), 130 bytes.

[4 May 2023 7:01] Zhejun Cai
Hi,
I made and uploaded a test case of mysql-8.0.33 for this issue
(1) build mysql-server with option WITH_DEBUG=1
(2) run the test case gr_xa_commit_failure_before_commit_hook.test

here is the output:
./mtr group_replication.gr_xa_commit_failure_before_commit_hook
Logging: ./mtr  group_replication.gr_xa_commit_failure_before_commit_hook
MySQL Version 8.0.33
Checking supported features
 - Binaries are debug compiled
Using 'all' suites
Collecting tests
Removing old var directory

rpl error summary: SERVER_1:(WORKERS:(CHANNEL:<group_replication_recovery> WORKER:1 ERROR:<Worker 1 failed executing transaction 'b7e0d0b3-ea45-11ed-9fd7-080027f4a265:8' at source log server-binary-log.000001, end_log_pos 2580; Error 'XAER_NOTA: Unknown XID' on query. Default database: 'test'. Query: 'XA COMMIT X'78696431',X'',1'>) COORDINATORS:(CHANNEL:<group_replication_recovery> ERROR:<Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'b7e0d0b3-ea45-11ed-9fd7-080027f4a265:8' at source log server-binary-log.000001, end_log_pos 2580. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, if any.>))
[4 May 2023 12:04] MySQL Verification Team
Hi Mr. Cai,

Please, just confirm whether the problem repeats on stand-alone server .......

From your description, it turns out that it repeats only with InnoDB Cluster.

Please, confirm that ......
[4 May 2023 12:12] MySQL Verification Team
Hi Mr. Cai,

We have another problem with your test case.

You are executing XA commands on all of the nodes, but. you are not using any XA manager.

Can you explain it ......
[4 May 2023 12:19] MySQL Verification Team
Hi,

This is InnoDB Cluster / Group Replication issue with XA transaction. Has nothing to do with MySQL Cluster (ndbcluster)
[5 May 2023 3:18] Zhejun Cai
Hi,
According to my understanding of the mysql-server code, standalone server is no problem, in the test case file, it has described the deployment topology
# Pre-conditions:
# PC1. GR single-primary topology with 3 servers.

To be precise, the test case does not execute XA commands on all of the nodes, it executes XA commands on the primary node, other replica nodes executes XA commands by group_replication applier.

I agree that this is InnoDB Cluster / Group Replication issue with XA transaction.
[5 May 2023 5:35] MySQL Verification Team
Hi,

Thank you for the test, verified as described.

...
2023-05-05 08:28:34.191020      32      Error   MY-010584       Repl    Replica SQL for channel 'group_replication_recovery': Worker 1 failed executing transaction '98cf9113-eb05-11ed-9d6c-000c29c354f6:8' at source log server-binary-log.000001, end_log_pos 2576; Error 'XAER_NOTA: Unknown XID' on query. Default database: 'test'. Query: 'XA COMMIT X'78696431',X'',1', Error_code: MY-001397
...
2023-05-05 08:28:35.103191      38      Error   MY-010584       Repl    Replica SQL for channel 'group_replication_recovery': Worker 1 failed executing transaction '98cf9113-eb05-11ed-9d6c-000c29c354f6:8' at source log server-binary-log.000001, end_log_pos 2580; Error 'XAER_NOTA: Unknown XID' on query. Default database: 'test'. Query: 'XA COMMIT X'78696431',X'',1', Error_code: MY-001397
...

LAST_ERROR_NUMBER       1397
LAST_ERROR_MESSAGE      Worker 1 failed executing transaction '98cf9113-eb05-11ed-9d6c-000c29c354f6:8' at source log server-binary-log.000001, end_log_pos 2580; Error 'XAER_NOTA: Unknown XID' on query. Default database: 'test'. Query: 'XA COMMIT X'78696431',X'',1'