Bug #92690 Group Replication split brain with faulty network
Submitted: 5 Oct 14:18 Modified: 19 Oct 12:59
Reporter: Przemyslaw Malkowski Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:5.7.23, 8.0.12 OS:Any
Assigned to: CPU Architecture:Any

[5 Oct 14:18] Przemyslaw Malkowski
Description:
In a MGR cluster, with group_replication_single_primary_mode, when writer node suffers very unstable network, instead of just being expelled cleanly from the cluster, it may end up with completely inconsistent binary logs comparing to the readers. 
For example, it is as bad as not only the writer and reader having different GTID sequence position applied, but completely different transactions get logged under the same GTID sequence on writer vs readers!

An example bad state reached - gr01 is RW, gr02/gr03 are RO nodes, with many GTID events completely different:

gr01 > show binlog events in 'binlog.000223' from 11590 limit 11;
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
| Log_name      | Pos   | Event_type  | Server_id | End_log_pos | Info                                                                   |
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
| binlog.000223 | 11590 | Gtid        |        10 |       11651 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:295533' |
| binlog.000223 | 11651 | Query       |        10 |       11723 | BEGIN                                                                  |
| binlog.000223 | 11723 | Table_map   |        10 |       11775 | table_id: 125 (db1.sbtest6)                                            |
| binlog.000223 | 11775 | Update_rows |        10 |       12185 | table_id: 125 flags: STMT_END_F                                        |
| binlog.000223 | 12185 | Table_map   |        10 |       12237 | table_id: 116 (db1.sbtest5)                                            |
| binlog.000223 | 12237 | Update_rows |        10 |       12647 | table_id: 116 flags: STMT_END_F                                        |
| binlog.000223 | 12647 | Table_map   |        10 |       12699 | table_id: 118 (db1.sbtest1)                                            |
| binlog.000223 | 12699 | Delete_rows |        10 |       12919 | table_id: 118 flags: STMT_END_F                                        |
| binlog.000223 | 12919 | Table_map   |        10 |       12971 | table_id: 118 (db1.sbtest1)                                            |
| binlog.000223 | 12971 | Write_rows  |        10 |       13191 | table_id: 118 flags: STMT_END_F                                        |
| binlog.000223 | 13191 | Xid         |        10 |       13218 | COMMIT /* xid=6231928 */                                               |
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
11 rows in set (0.00 sec)

gr02 > show binlog events in 'binlog.000221' from 9912 limit 11;
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
| Log_name      | Pos   | Event_type  | Server_id | End_log_pos | Info                                                                   |
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
| binlog.000221 |  9912 | Gtid        |        10 |        9973 | SET @@SESSION.GTID_NEXT= 'aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:295533' |
| binlog.000221 |  9973 | Query       |        10 |       10037 | BEGIN                                                                  |
| binlog.000221 | 10037 | Table_map   |        10 |       10089 | table_id: 108 (db1.sbtest5)                                            |
| binlog.000221 | 10089 | Update_rows |        10 |       10499 | table_id: 108 flags: STMT_END_F                                        |
| binlog.000221 | 10499 | Table_map   |        10 |       10551 | table_id: 110 (db1.sbtest4)                                            |
| binlog.000221 | 10551 | Update_rows |        10 |       10961 | table_id: 110 flags: STMT_END_F                                        |
| binlog.000221 | 10961 | Table_map   |        10 |       11014 | table_id: 109 (db1.sbtest10)                                           |
| binlog.000221 | 11014 | Delete_rows |        10 |       11234 | table_id: 109 flags: STMT_END_F                                        |
| binlog.000221 | 11234 | Table_map   |        10 |       11287 | table_id: 109 (db1.sbtest10)                                           |
| binlog.000221 | 11287 | Write_rows  |        10 |       11507 | table_id: 109 flags: STMT_END_F                                        |
| binlog.000221 | 11507 | Xid         |        10 |       11534 | COMMIT /* xid=1185380 */                                               |
+---------------+-------+-------------+-----------+-------------+------------------------------------------------------------------------+
11 rows in set (0.00 sec)

In the case above, gr01 node was expelled due to network errors, but managed to have higher executed position:

Variable_name: gtid_executed
        Value: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-295538

vs the two other nodes:

Variable_name: gtid_executed
        Value: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-295534

and table data mismatch:

gr01 > checksum table db1.sbtest9;
+-------------+------------+
| Table       | Checksum   |
+-------------+------------+
| db1.sbtest9 | 2682847957 |
+-------------+------------+
1 row in set (0.00 sec)

gr02 > checksum table db1.sbtest9;
+-------------+------------+
| Table       | Checksum   |
+-------------+------------+
| db1.sbtest9 | 3427632976 |
+-------------+------------+
1 row in set (0.01 sec)

How to repeat:
Set up simple MRG cluster, with basic settings, like:

server_id=30 //unique on each node
gtid_mode=ON
enforce_gtid_consistency=ON
master_info_repository=TABLE
relay_log_info_repository=TABLE
binlog_checksum=NONE
log_slave_updates=ON
log_bin=binlog
binlog_format=ROW
transaction_write_set_extraction=XXHASH64
loose-group_replication_group_name="aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa"
loose-group_replication_start_on_boot=off
loose-group_replication_local_address="192.168.56.99:24900" //unique
loose-group_replication_group_seeds="192.168.56.97:24900,192.168.56.98:24900,192.168.56.99:24900"
loose-group_replication_bootstrap_group=off

Once GR is up:
gr03 > select * from performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 1f912efe-c880-11e8-9592-525400cae48b | gr02        |        3306 | ONLINE       |
| group_replication_applier | 20c6a737-c880-11e8-bf58-525400cae48b | gr03        |        3306 | ONLINE       |
| group_replication_applier | bb15fca5-c87d-11e8-bcbf-525400cae48b | gr01        |        3306 | ONLINE       |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0.00 sec)

Prepare for sysbench and start the example script as below:

gr01 > create database db1;
Query OK, 1 row affected (0.00 sec)

[root@gr01 ~]# sysbench /usr/share/sysbench/oltp_read_write.lua --mysql_storage_engine=innodb  --table-size=10000 --tables=10 --mysql-db=db1 --mysql-user=root --mysql-password=*** --threads=8 --time=10 --report-interval=1 --events=0 --db-driver=mysql prepare
(...)

#!/bin/bash

for i in {1..100}; do
 rm -f test_?.out
        for i in {7..9}; do
                mysql -h192.168.56.9$i -udba -p*** -e "show global variables like 'gtid_executed'\G" >> test_$i.out 2>/dev/null;
                mysql -h192.168.56.9$i -udba -p*** -e "flush binary logs" 2>/dev/null;
        done
 diff3 test_?.out

 sysbench /usr/share/sysbench/oltp_read_write.lua --mysql_storage_engine=innodb  --table-size=10000 --tables=10 --mysql-db=db1 --mysql-user=root --mysql-password=*** --threads=8 --time=10 --report-interval=1 --events=0 --db-driver=mysql run > /dev/null
done

During the test is running, introduce network instability for the interface being responsible for cluster communication. In my case it was eth1:

[root@gr01 ~]# tc qdisc add dev eth1 root netem delay 20ms loss 40% 75% corrupt 5% 45%

After some time, the script should report different GTID positions, like:

====1
1:3c
          Value: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-295538
2:3c
3:3c
          Value: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-295534

This is the sign the binlogs are already inconsistent. I used very primitive diff method above, but with some smart script/tool for comparing events under the same GTID seqno, it should be catched up faster.

Very likely the primary writer will be expelled from the cluster, but I experienced also a case where cluster kept running for a while with inconsistent GTIDs and failed later with simple 1032 or 1062 replication error.

Suggested fix:
GTID becoming inconsistent is IMHO a very serious issue. It leads to data split brain situation, something that should never happen (especially given the fact that we have just single writer).
[9 Oct 14:31] Bogdan Kecman
Hi,

Took some hours but I reproduced the issue. Thanks for the report.

Bogdan

p.s. for some reason this required real hw to be reproduced, inside vm even with different tcp/ip shaping I was not able to reproduce it
[9 Oct 15:00] Bogdan Kecman
it would help also if you can upload your logs (both binary/relay and error log).
[10 Oct 12:10] Przemyslaw Malkowski
Bogdan, do you mean that if I had uploaded them before, reproducing would be easier or that still my logs will be useful?
[11 Oct 10:09] Bogdan Kecman
They are still useful, so if you can please upload yours. Having two sets (mine and yours) will help fixing it.
[19 Oct 10:14] Przemyslaw Malkowski
Ran same tests on MySQL 8.0.12 and ended up with same consistency disaster. 
More details in the attachment, but I think this tells us more how confusing the situation is:

gr01 > select * from performance_schema.replication_group_member_stats\G
*************************** 1. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15399414486931668:7
                                 MEMBER_ID: 1f912efe-c880-11e8-9592-525400cae48b
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 0
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-302651
            LAST_CONFLICT_FREE_TRANSACTION: 
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 0
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 2. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15399414486931668:7
                                 MEMBER_ID: 20c6a737-c880-11e8-bf58-525400cae48b
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 7108
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 0
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-302651
            LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:302650
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 7110
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 0
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
*************************** 3. row ***************************
                              CHANNEL_NAME: group_replication_applier
                                   VIEW_ID: 15399414486931668:7
                                 MEMBER_ID: 76df8268-c95e-11e8-b55d-525400cae48b
               COUNT_TRANSACTIONS_IN_QUEUE: 0
                COUNT_TRANSACTIONS_CHECKED: 7115
                  COUNT_CONFLICTS_DETECTED: 0
        COUNT_TRANSACTIONS_ROWS_VALIDATING: 18
        TRANSACTIONS_COMMITTED_ALL_MEMBERS: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:1-302651
            LAST_CONFLICT_FREE_TRANSACTION: aaaaaaaa-aaaa-aaaa-aaaa-aaaaaaaaaaaa:302657
COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE: 0
         COUNT_TRANSACTIONS_REMOTE_APPLIED: 4
         COUNT_TRANSACTIONS_LOCAL_PROPOSED: 7115
         COUNT_TRANSACTIONS_LOCAL_ROLLBACK: 0
3 rows in set (0.00 sec)
[19 Oct 10:17] Przemyslaw Malkowski
err logs and more details

Attachment: 92690-8.0.tgz (application/x-compressed-tar, text), 13.65 KiB.

[19 Oct 12:38] Bogdan Kecman
Thanks for the update, any way you can also compress and upload binary and relay logs?
[19 Oct 13:10] Bogdan Kecman
yes, please split into 3M parts
[19 Oct 14:08] Przemyslaw Malkowski
binlogs part1

Attachment: 92690-binlogs.tgz_00 (application/octet-stream, text), 2.92 MiB.

[19 Oct 14:08] Przemyslaw Malkowski
binlogs part2

Attachment: 92690-binlogs.tgz_01 (application/octet-stream, text), 2.92 MiB.

[19 Oct 14:09] Przemyslaw Malkowski
binlogs part3

Attachment: 92690-binlogs.tgz_02 (application/octet-stream, text), 2.92 MiB.

[19 Oct 14:10] Przemyslaw Malkowski
binlogs part4

Attachment: 92690-binlogs.tgz_03 (application/octet-stream, text), 2.92 MiB.

[19 Oct 14:10] Przemyslaw Malkowski
binlogs part5

Attachment: 92690-binlogs.tgz_04 (application/octet-stream, text), 2.92 MiB.

[19 Oct 14:11] Przemyslaw Malkowski
binlogs part6

Attachment: 92690-binlogs.tgz_05 (application/octet-stream, text), 116.04 KiB.

[1 Nov 8:00] jifa Xu
I have meet a same problem with 5.7.22

I use mgr with single primary mode

primary node's mysqld process memory usage increase to 96%(about 310G,the innodb_buffer_pool_size is 47G), then, system kill mysqld process because of OOM

After that, when start group_replication; the old primary node has more transaction than other two nodes. When check mysql binlog, I found that there are about 5 transaction only exists in primary node.
[1 Nov 8:11] Bogdan Kecman
> I have meet a same problem with 5.7.22

please, if you can, upload
 - mysql logs
 - relay logs
 - binary logs

that will speed up resolution of the bug

thanks
Bogdan
[2 Nov 10:48] jifa Xu
I add some log to mysql

and find some thing error in xcom_base.c executor_task

code:

 else if ((ep->p)->learner.msg->msg_type != no_op) {
    execute_msg(find_site_def(delivered_msg), ep->p, ep->p->learner.msg);
#if defined(TASK_DBUG_ON) && TASK_DBUG_ON
    DBGOUT(perf_dbg(&ep->n, &ep->old_n, &ep->old_t));
#endif
}

when error happened(with "tc qdisc add dev eth0 root netem loss 40%" ), there are one message which msg_type is normal in primary node, but on secondary node, the msg_type is no_op (I use msgno and nodeno to make sure there are same message). Then it cause data inconsistent.
[13 Nov 5:54] Gosin Gu
I found this location in the source code of mysql:
https://github.com/dbgosin/mysql-server/blob/df0bc0a67b5cca06665e1d501a2f74b712af5cf5/rapi...

code : 

                if (gt_ballot(m->proposal, p->proposer.msg->proposal)) { /* greater */
			replace_pax_msg(&p->proposer.msg, m);
			assert(p->proposer.msg);
		}

When encountering such a scene:

Three nodes: n0, n1, n2, one transaction only node 0 and node 1 accpeted, node 2 initiated prepare_op to 0, 1.

 at this time, node1's handle_prepare_op copies the accpetor of the state machine  into the message reply to node 2 with ack_prepare_op, when node 2 accepts node 1 After the paxos msg, I think that a reasonable algorithm should look like this:
    gt_ballot(m->proposal, p->proposer.msg->proposal) == true
    Then :replace_pax_msg(&p->proposer.msg, m);

In fact not the case, due to m->proposal == p->proposer.msg->proposa;
so It will ignore the already accepted messages sent by node 1.

Eventually, node 0 has completed learn, and node 1 is not completed. In addition, node 2 also causes node1 to be replaced by empty message of node2 because of check_propose, and finally the node1 and node2 transactions are lost.