MySQL Bugs: #84900: Getting inconsistent result on different nodes

Bug #84900	Getting inconsistent result on different nodes
Submitted:	9 Feb 2017 2:39	Modified:	17 Feb 2017 18:38
Reporter:	Vadim Tkachenko	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S1 (Critical)
Version:	5.7.17	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
This is followup to the bug
https://bugs.mysql.com/bug.php?id=82481

I am using following tables:

 show create table t1\G
*************************** 1. row ***************************
       Table: t1
Create Table: CREATE TABLE `t1` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `balance` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=239464779 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> show create table t2\G 
*************************** 1. row ***************************
       Table: t2
Create Table: CREATE TABLE `t2` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `balance` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=239007820 DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

How to repeat:
The procedure to populate table t1 is following:

insert 30+ mln rows into table t1. all rows have balance=200;

table t2 is

insert into t2 (balance) select balance from t2;

Now, on the node 1, I execute following transaction:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> update t1 set balance=1000 where id < 70000000;
Query OK, 9475864 rows affected (1 min 42.59 sec)
Rows matched: 9475864  Changed: 9475864  Warnings: 0

mysql> update t2 set balance=1000 where id < 70000000;
Query OK, 9475864 rows affected (1 min 42.28 sec)
Rows matched: 9475864  Changed: 9475864  Warnings: 0

mysql> commit;
Query OK, 0 rows affected (3 min 9.50 sec)

mysql> select count(*) from t1 where balance=1000;
+----------+
| count(*) |
+----------+
|  9475864 |
+----------+
1 row in set (9.41 sec)

But when commit finished, and even couple minutes later, I execute on node2:

mysql> select count(*) from t1 where balance=1000;
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (13.25 sec)

There is inconsistency in the results on node1 and node2

Suggested fix:
Make all nodes consistent.

Hello Vadim,

Thank you for the report and feedback.

Thanks,
Umesh

test results

Attachment: 84900.results (application/octet-stream, text), 7.82 KiB.

Posted by developer:
 
Hi Vadim,

Thank you for evaluating Group Replication.

Can you please include the member state after "select count(*) from t1 where balance=1000;" on each member?
That is, execute:
SELECT * FROM performance_schema.replication_group_members;

Best regards,
Nuno Carvalho

Node1:

mysql> select count(*) from t1 where balance=1000;
+----------+
| count(*) |
+----------+
|  9475864 |
+----------+
1 row in set (4.13 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | d144d406-eee7-11e6-a430-0242ac120002 | 133f838e4f91 |        3306 | ONLINE       |
| group_replication_applier | d77757d9-eee7-11e6-a3b2-0242ac120003 | b7fcf6e3a4a1 |        3306 | ONLINE       |
| group_replication_applier | dda0209b-eee7-11e6-a37f-0242ac120004 | 2a66e0cd885e |        3306 | ONLINE       |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
3 rows in set (0.00 sec)

Node2:
mysql> select count(*) from t1 where balance=1000;
+----------+
| count(*) |
+----------+
|  0       |
+----------+
1 row in set (4.87 sec)

mysql> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | d144d406-eee7-11e6-a430-0242ac120002 | 133f838e4f91 |        3306 | ONLINE       |
| group_replication_applier | d77757d9-eee7-11e6-a3b2-0242ac120003 | b7fcf6e3a4a1 |        3306 | ONLINE       |
| group_replication_applier | dda0209b-eee7-11e6-a37f-0242ac120004 | 2a66e0cd885e |        3306 | ONLINE       |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
3 rows in set (0.00 sec)

Hi Vadim,

Thank you for this bug report and for giving Group Replication a
try. We value all the feedback we can get.

Summary
=======

In this particular case, the system is operating properly. The fact
that you change 30.000.000 rows in a single transaction, means that
these changes need to be *propagated*, and then *applied* on the
receiving end. Originating node only waits for the propagation to take
place. And this [may] take more time for transactions changing a large
amount of rows, than those changing a small amount. In GR or any other
replicated system.

Yes, GR has provisioning for on-the-wire compression, which can help,
but then it is a balance of how much CPU cycles you trade for doing
compression instead of working on propagating data. So, it depends
really on the context (workload, network properties, ...).

Also, you may be running into a misconception or wrong assumption from
the start. GR is not a synchronous replication system. This means that
changes are propagated to each node in the group and eventually
applied to the database. Yes, there is a *logical* synchronization
point, servers agree on message delivery order, but the entire process
is asynchronous.

Scenario
========

Here are some specific details about the scenario reported in this
bug.

For a transaction to commit, the majority of the group have to agree
on the order of a given transaction in the global sequence of
transactions.  Deciding to commit or abort a transaction is done by
each server individually, but all servers make the same decision.

After the agreement on the order of a given transaction in the global
sequence of transactions, each server will apply the transaction
asynchronously. That is, all servers apply data at each own pace, which
may means that data may be committed sooner on some members than others.

This is exactly what is happening on this scenario, after the order of
T1 is agree by all servers, server 1 only has to commit it to InnoDB and
binary log, all processing is already done. While on server 2, it still
needs to process everything: read the replication events, apply rows and
then commit it to InnoDB and binary log. Since this transaction has more
than 30.000.000 rows, when you query server 2, it is still applying T1,
you can check that with SHOW @@GLOBAL.GTID_EXECUTED and SHOW PROCESSLIST
command.

Example:
[connection server1]
> SELECT @@GLOBAL.GTID_EXECUTED;
62682ba7-f114-11e6-a941-a6c3dd0a98de:1-6
>
> [connection server2]
> SELECT @@GLOBAL.GTID_EXECUTED;
62682ba7-f114-11e6-a941-a6c3dd0a98de:1-5

T1 is being read from relay log and applied by applier thread.
[connection server2]
> SHOW PROCESSLIST;
Id  User  Host  db  Command Time  State Info
...
9 system user   NULL  Connect 62  executing NULL
12  system user   NULL  Connect 11  Reading event from the relay log

T1 was read completely and is being applied.
[connection server2]
> SHOW PROCESSLIST;
Id  User  Host  db  Command Time  State Info
...
9 system user   NULL  Connect 71  executing NULL
12  system user   NULL  Connect 20  Slave has read all relay log;
    waiting for more updates  NULL

We can also check the progress on replication performance tables.
T1 was written to relay log, RECEIVED_TRANSACTION_SET already contains
T1 (62682ba7-f114-11e6-a941-a6c3dd0a98de:6).
[connection server2]
> SELECT * FROM performance_schema.replication_connection_status WHERE
  channel_name='group_replication_applier';
CHANNEL_NAME  group_replication_applier
GROUP_NAME  62682ba7-f114-11e6-a941-a6c3dd0a98de
SOURCE_UUID 62682ba7-f114-11e6-a941-a6c3dd0a98de
THREAD_ID NULL
SERVICE_STATE ON
COUNT_RECEIVED_HEARTBEATS 0
LAST_HEARTBEAT_TIMESTAMP  0000-00-00 00:00:00
RECEIVED_TRANSACTION_SET  62682ba7-f114-11e6-a941-a6c3dd0a98de:1-6

Applier did already seen T1.
[connection server2]
> SELECT * FROM performance_schema.replication_applier_status_by_worker
  WHERE channel_name='group_replication_applier';
CHANNEL_NAME  group_replication_applier
WORKER_ID 0
THREAD_ID 29
SERVICE_STATE ON
LAST_SEEN_TRANSACTION 62682ba7-f114-11e6-a941-a6c3dd0a98de:6

T1 is applied.
[connection server2]
> SELECT @@GLOBAL.GTID_EXECUTED;
62682ba7-f114-11e6-a941-a6c3dd0a98de:1-6

On the current version, Group Replication does not provide synchronous
replication, that is stated on the manual at:
https://dev.mysql.com/doc/refman/5.7/en/group-replication-frequently-asked-questions.html#...
https://dev.mysql.com/doc/refman/5.7/en/group-replication-flow-control.html

We will state that more clearly on the beginning of Group Replication
manual chapter.

T1:
  begin;
  update t1 set balance=1000 where id < 70000000;
  update t2 set balance=1000 where id < 70000000;
  commit;

If you feel that I've missed something, please let me know.

Best regards,
Nuno Carvalho

ok, I see your point.
Then please consider this as a feature request: Provide the way for read transaction to read actual, not the stale data.