Bug #103641 After the mgr primary node was killed, the data is inconsistent between nodes
Submitted: 10 May 2021 14:50 Modified: 3 Jul 2021 12:57
Reporter: Ye Jinrong Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S1 (Critical)
Version:8.0.22 OS:Any
Assigned to: MySQL Verification Team CPU Architecture:Any

[10 May 2021 14:50] Ye Jinrong
Description:
A 4 nodes MGR cluster run with single-master mode.
node1: secondary node
node2: secondary node
node3: primary node
node4: secondary node

During the benchmark test, execute the kill -9 mysqld process on the primary node and any secondary node.

At this time, the MGR cluster cannot write new data. The two nodes that have been killed need to be started, and the innodb crash recovery should be completed first. 

Check the transaction GTID progress of each MGR node with the SQL command:
```
select RECEIVED_TRANSACTION_SET from performance_schema.replication_connection_status union all select variable_value from performance_schema.global_variables where variable_name = 'gtid_executed';
```

node1:
```
| gtid_executed | f195537d-19ac-11eb-b29f-5254002eb6d6:1-301092789 |
| RECEIVED_TRANSACTION_SET | f195537d-19ac-11eb-b29f-5254002eb6d6:1-301092789 |
```

node2:(had been killed by kill -9 mysqld pid)
```
| gtid_executed | f195537d-19ac-11eb-b29f-5254002eb6d6:1-301092789 |
| RECEIVED_TRANSACTION_SET | f195537d-19ac-11eb-b29f-5254002eb6d6:1-300917674:300917679-301092789 |
```

node3:(original primary node, had been killed by kill -9 mysqld pid)
```
| gtid_executed | f195537d-19ac-11eb-b29f-5254002eb6d6:1-301092699:301092701-301092723:301092725-301092743:301092746:301092748-301092749 |
| RECEIVED_TRANSACTION_SET | f195537d-19ac-11eb-b29f-5254002eb6d6:1-300917672:300917678-300917680 |
```

node4:
| gtid_executed | f195537d-19ac-11eb-b29f-5254002eb6d6:1-301092789 |
| RECEIVED_TRANSACTION_SET | f195537d-19ac-11eb-b29f-5254002eb6d6:1-301092789 |
```

We choose node1 or node4 as the new primary node and start the mgr cluster again.

After node3 (the original primary node) completes mgr recovery, it is found that its data is inconsistent with other nodes.

The gtid range of the node3 node that needs to restore data is 301092749-301092789, but checking the binlog found that some of the transactions in it are updating data on other nodes, but the same gtid transaction is executed during the node3 recovery period is the view change event, so it leads to Inconsistent data, for example: 

one gtid event on node4
```
19679267 SET @@SESSION.GTID_NEXT= 'f195537d-19ac-11eb-b29f-5254002eb6d6:304451797'/*!*/;
19679268 # at 511622795
19679269 #210430 16:22:25 server id 3410  end_log_pos 511622867 CRC32 0x4c32e87d         Query   thread_id=3123  exec_time=019679269      error_code=0
19679270 SET TIMESTAMP=1619770945/*!*/;
19679271 BEGIN
19679272 /*!*/;
19679273 # at 511622867
19679274 #210430 16:22:25 server id 3410  end_log_pos 511622935 CRC32 0x9c37af89         Table_map: `sbtest`.`sbtest10` mapp19679274 ed to number 98
19679275 # at 511622935
19679276 #210430 16:22:25 server id 3410  end_log_pos 511623351 CRC32 0xfa505337         Update_rows: table id 98 flags: STM19679276 T_END_F
19679277 ### UPDATE `sbtest`.`sbtest10`
19679278 ### WHERE
19679279 ###   @1=9136
```

the same gtid was view change event on node3(found in the recovery binlog)
```
SET @@SESSION.GTID_NEXT= 'f195537d-19ac-11eb-b29f-5254002eb6d6:304451797'/*!*/;
# at 1097
#210430 16:26:22 server id 3410  end_log_pos 1163 CRC32 0x85b57863      Query   thread_id=14    exec_time=19    error_code=0
SET TIMESTAMP=1619771182/*!*/;
BEGIN
/*!*/;
# at 1163
#210430 16:26:22 server id 3410  end_log_pos 1514 CRC32 0x48a5a849 View_change_log_event: view_id=16197711839237518:2
# at 1514
#210430 16:26:22 server id 3410  end_log_pos 1586 CRC32 0xf7d30016      Query   thread_id=14    exec_time=19    error_code=0
SET TIMESTAMP=1619771182/*!*/;
COMMIT
```

This leads to inconsistent data among several nodes after MGR recovery.

How to repeat:
1. Start a 4 nodes MGR cluster, assuming node3 is the primary node, and the rest are secondary nodes

2. Start sysbench pressure test

3. After about 5 minutes, execute kill -9 on the nodes of node3 and node2 (first kill the primary node, then kill the secondary node, and the speed should be fast enough)

4. At this time, the mgr cluster cannot write data, and perform ```stop group_replication``` on node1 and node4

5. Start node2 and node3 and execute them on the 4 nodes:
```
select RECEIVED_TRANSACTION_SET from performance_schema.replication_connection_status union all select variable_value from performance_schema.global_variables where variable_name ='gtid_executed';
```

6. Confirm which node has more gtid and choose it as the new primary node. Generally, node1 and node4 are the same, you can choose one as the new primary node

7. Restart the mgr cluster, start node2 and node3, and execute ```start group_replication```, at this time node2 and node3 perform mgr recovery

8. When node3 completes mgr recovery, it will find that its data is inconsistent with node4

9. Check and confirm the binlog generated during mgr recovery on node3, and you will find that the event of the same gtid is a DML operation on node4, while it is a view change event on node3, which leads to inconsistent data 

Suggested fix:
When performing mgr recovery, it is necessary to determine whether the transaction corresponding to the gtid of each node is the same transaction, and the view change event cannot occupy the normal transaction gtid on other nodes.
[13 May 2021 12:21] MySQL Verification Team
Hi,

Trying to reproduce this with 8.0.25 and I can't. 
Did you try to reproduce this with 8.0.25?

Thanks
Bogdan
[14 May 2021 6:16] Ye Jinrong
I will retry with MySQL 8.0.25 as soon as possiable.
[24 May 2021 3:11] Ye Jinrong
Hi Bogdan,

I can reproduce this problem with MySQL 8.0.25.
[27 May 2021 0:30] MySQL Verification Team
Hi,

Thanks for re-testing. I did not manage to reproduce with .25, will have to retry again.

all best
Bogdan
[3 Jun 2021 12:57] MySQL Verification Team
Hi,

I cannot reproduce this with 8.0.25, is there a way you can try to make a test case that can reproduce it ( https://dev.mysql.com/doc/dev/mysql-server/latest/PAGE_WRITING_TESTCASES.html ), I did everything you wrote and I can't reproduce it.

thanks
[4 Jul 2021 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".