Bug #84901 | Can't execute transaction on second node | ||
---|---|---|---|
Submitted: | 9 Feb 2017 4:37 | Modified: | 11 Aug 2017 2:42 |
Reporter: | Vadim Tkachenko | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: Group Replication | Severity: | S1 (Critical) |
Version: | 5.7.17 | OS: | Any |
Assigned to: | Nuno Carvalho | CPU Architecture: | Any |
[9 Feb 2017 4:37]
Vadim Tkachenko
[9 Feb 2017 12:46]
MySQL Verification Team
Hello Vadim, Thank you for the report and feedback. Thanks, Umesh
[9 Feb 2017 12:47]
MySQL Verification Team
Seen this issue frequently while verifying Bug#84900
Attachment: 84900.results (application/octet-stream, text), 7.82 KiB.
[9 Feb 2017 16:12]
Nuno Carvalho
Posted by developer: Hi Vadim, Thank you for evaluating Group Replication. Can you please include the member state after "update t1 set balance=1000 where id < 70000000;" on each member? That is, execute: SELECT * FROM performance_schema.replication_group_members; Best regards, Nuno Carvalho
[9 Feb 2017 18:56]
Vadim Tkachenko
Node1: mysql> update t1 set balance=1000 where id < 70000000; Query OK, 8388608 rows affected (1 min 28.39 sec) Rows matched: 8388608 Changed: 8388608 Warnings: 0 mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------+-------------+--------------+ | group_replication_applier | d144d406-eee7-11e6-a430-0242ac120002 | 133f838e4f91 | 3306 | ONLINE | | group_replication_applier | d77757d9-eee7-11e6-a3b2-0242ac120003 | b7fcf6e3a4a1 | 3306 | ONLINE | | group_replication_applier | dda0209b-eee7-11e6-a37f-0242ac120004 | 2a66e0cd885e | 3306 | ONLINE | +---------------------------+--------------------------------------+--------------+-------------+--------------+ 3 rows in set (0.00 sec) Node2: mysql> update t2 set balance=117 where id > 10000000 and id < 10001000; ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction. mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------+-------------+--------------+ | group_replication_applier | d144d406-eee7-11e6-a430-0242ac120002 | 133f838e4f91 | 3306 | ONLINE | | group_replication_applier | d77757d9-eee7-11e6-a3b2-0242ac120003 | b7fcf6e3a4a1 | 3306 | ONLINE | | group_replication_applier | dda0209b-eee7-11e6-a37f-0242ac120004 | 2a66e0cd885e | 3306 | ONLINE | +---------------------------+--------------------------------------+--------------+-------------+--------------+ 3 rows in set (0.00 sec)
[9 Feb 2017 19:10]
Vadim Tkachenko
in fact, when I execute ROLLBACK on the Node1 after update, I am getting the same error on the Node2. It seems that Node2 gets in the totally unusable state.
[9 Feb 2017 19:59]
Vadim Tkachenko
I think this is related to https://bugs.mysql.com/bug.php?id=83218
[24 Mar 2017 16:52]
Nuno Carvalho
Hi Vadim, Thank you for evaluating Group Replication. I'm trying to reproduce your scenario: Prepare a group with 3 members with the following tables: CREATE TABLE test.t1 ( `id` int(11) NOT NULL AUTO_INCREMENT, `balance` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=239464779 DEFAULT CHARSET=latin1; CREATE TABLE test.t2 ( `id` int(11) NOT NULL AUTO_INCREMENT, `balance` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB AUTO_INCREMENT=239464779 DEFAULT CHARSET=latin1; Insert to each table 30000000 rows. Scenario: Member 1: mysql> use test; Database changed mysql> begin; Query OK, 0 rows affected (0.00 sec) mysql> update t1 set balance=1000 where id < 70000000; Query OK, 30000000 rows affected (3 min 48.05 sec) Rows matched: 30000000 Changed: 30000000 Warnings: 0 Please note that I'm updating 30000000 rows while you do update 9475864 rows. Member 2: mysql> mysql> update t2 set balance=117 where id > 10000000 and id < 10001000; Query OK, 999 rows affected (0.13 sec) Rows matched: 999 Changed: 999 Warnings: 0 All is OK, no rollback happens on Member 2. I'm unable to reproduce your scenario on which the transaction on member 2 will rollback. A different scenario that I'm able to reproduce after the group prepare it is: Member 1: mysql> use test; Database changed mysql> begin; Query OK, 0 rows affected (0.00 sec) mysql> update t1 set balance=1000 where id < 70000000; Query OK, 30000000 rows affected (3 min 48.05 sec) Rows matched: 30000000 Changed: 30000000 Warnings: 0 mysql> commit; ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction. mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+-------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+-------------+-------------+--------------+ | group_replication_applier | b95881bd-10a2-11e7-b220-0010e0734796 | server1 | 45000 | ERROR | +---------------------------+--------------------------------------+-------------+-------------+--------------+ 1 row in set (0.01 sec) That it is, member 1 was considered faulty during the send of this big transaction, 1.3GB of data, and was expelled from the group. Error log: [ERROR] Plugin group_replication reported: 'Member was expelled from the group due to network failures, changing member status to ERROR.' This issue is BUG#84785: "Prevent Large Transactions in Group Replication", on which we are working on several fronts. Can you please add to the bug report more details about your scenario? Like: Group size? Is there background load? What is the network specifications? What is the servers configuration? Best regards, Nuno Carvalho
[24 Mar 2017 17:09]
Vadim Tkachenko
Nuno, As I understand Umesh was able to repeat and verify the issue - maybe you can collaborate internally between the teams?
[24 Mar 2017 17:15]
Nuno Carvalho
Hi Vadim, Umesh did reproduce the second scenario that I listed on my reply, which is BUG#84785: "Prevent Large Transactions in Group Replication". I need more details in order to reproduce your exact scenario, on which the statement on member 2 is rollback but no member is expelled from the group. Best regards, Nuno Carvalho
[24 Mar 2017 17:18]
Vadim Tkachenko
I will provide more information next week. Meantime it may be related to the configuration I use. The config file for the reference is https://github.com/Percona-Lab/percona-docker/blob/master/pgr-57/node.cnf and I use docker images from https://hub.docker.com/r/perconalab/pgr-57/ and I start nodes like https://github.com/Percona-Lab/percona-docker/blob/master/pgr-57/start_node.sh
[24 Mar 2017 17:25]
Nuno Carvalho
Hi Vadim, Thank you, I will try to reproduce with this information. Best regards, Nuno Carvalho
[4 May 2017 14:49]
Nuno Carvalho
Hi Vadim, Like we did discuss face to face last week, can you please provide more details on how to reproduce this issue? So far, I'm unable to reproduce it. Actually, since member1 does not commit, member2 is not aware of what is happening on member1, thence there is no reason to rollback the transaction. Unless there is some environment issue going on. Since, apparently, you are able to reproduce it easily on your environment, can you please share all the environment details? Best regards, Nuno Carvalho
[4 May 2017 17:34]
Vadim Tkachenko
I still can easily repeat this with 5.7.18. I use Docker images built by myself and use my config file. There could be something wrong with my setup, but I'd like to understand what. My setup: docker images: https://hub.docker.com/r/perconalab/mgr/ I start docker nodes as: https://github.com/vadimtk/mgr-bug84901/blob/master/start_node.sh All my Dockerfile and related files are: https://github.com/vadimtk/mgr-bug84901 So, I start nodes with start_node.sh and then execute following: Node3: create database test1; use test1; CREATE TABLE `t1` ( `id` int(11) NOT NULL AUTO_INCREMENT, `balance` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB CREATE TABLE `t2` ( `id` int(11) NOT NULL AUTO_INCREMENT, `balance` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB mysql> insert into t1 (balance) value (200); mysql> insert into t1 (balance) value (200); mysql> insert into t1 (balance) value (200); mysql> insert into t1 (balance) value (200); mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; mysql> insert into t1 (balance) select balance from t1; Query OK, 2097152 rows affected (26.77 sec) Records: 2097152 Duplicates: 0 Warnings: 0 insert into t2 (balance) select balance from t1; select max(id) from t1; +----------+ | max(id) | +----------+ | 32112256 | +----------+ That is a setup. Now the bug. On Node3: mysql> begin; Query OK, 0 rows affected (0.00 sec) mysql> update t1 set balance=1000 where id < 70000000; Query OK, 4194304 rows affected (46.85 sec) Rows matched: 4194304 Changed: 4194304 Warnings: 0 after this, on Node2: mysql> update t2 set balance=117 where id > 10000000 and id < 10001000; ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction. mysql> SELECT * FROM performance_schema.replication_group_members; +---------------------------+--------------------------------------+--------------+-------------+--------------+ | CHANNEL_NAME | MEMBER_ID | MEMBER_HOST | MEMBER_PORT | MEMBER_STATE | +---------------------------+--------------------------------------+--------------+-------------+--------------+ | group_replication_applier | 60b4ae24-30ec-11e7-99a9-0242ac130002 | 346b40136f08 | 3306 | ONLINE | | group_replication_applier | 66cca673-30ec-11e7-9af8-0242ac130003 | 66f0f8abce8e | 3306 | ONLINE | | group_replication_applier | 6cd4045c-30ec-11e7-9b61-0242ac130004 | d9ad815d3076 | 3306 | ONLINE | +---------------------------+--------------------------------------+--------------+-------------+--------------+ 3 rows in set (0.00 sec)
[19 Jul 2017 12:39]
MySQL Verification Team
Sorry for the confusion, mistakenly marked verified today. rolled back to open after confirming with Nuno. regards, umesh
[10 Aug 2017 15:40]
Anibal Pinto
Thanks for delivering the configurations and scripts that allowed to build a docker image. I was able to reproduce the problem that you mentioned but this is a configuration problem, as explained in detail below. On your configuration you use loose-group_replication_allow_local_disjoint_gtids_join=ON, this is a last resort option to join a server in the group that has unknown transactions to the group. There are no reason to use this configuration option when starting a new group where all members don't have any executed GTID . Thanks to this report we reached the conclusion that the documentation about this variable should be improved, something we will work on. (detailed explanation on next comment)
[10 Aug 2017 15:41]
Anibal Pinto
Detailed configuration issue ---------------------------- After removing this variable your setup will now fail as expected. When starting the group there isn't a problem with the local GTIDs, as others members will receive the GTID executed over recovery process. When a member joins an existing group, the other members won't be able to get the local executed GTIDs and it will trigger an error when executing START GROUP_REPLICATION: ERROR 3092 (HY000): The server is not configured properly to be an active member of the group. Please see more details on error log. If you consult the log, it has: [ERROR] Plugin group_replication reported: 'This member has more executed transactions than those present in the group. Local transactions: 7145971d-78fd-11e7-a691-0242ac120003:1-5 > Group transactions: 7107c96e-78fd-11e7-a6d6-0242ac120002:1-5, aaaaaaaa-bbbb-cccc-dddd-eeeeeeeeeeee:1' But from where did these GTIDs come from. On your configuration of the users, you have: ... SET SQL_LOG_BIN=1; CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='rpl_pass' FOR CHANNEL 'group_replication_recovery'; FLUSH PRIVILEGES ; .. After enabling the BIN LOG you execute an FLUSH PRIVILEGES [1] that by default writes to binlog, this statement needs to be executed when logging to the binlog is disabled. Other problem is the docker image, after loading the configuration it has logged transactions as seen in the GTID_EXECUTED value and it shouldn't. We are investigating that with our docker team. What we advise, until the problem is solved on the docker image, is to execute a 'RESET MASTER' before installing the Group Replication plugin to remove all entries from binlog. Why server2 transaction did rollback? ------------------------------------- Splitting your scenario over a timeline (T<Num>) we have: T1: server1 docker start Transactions created due to docker statements Group/Server GTIDs: server1: server_uuid1:1-5 T2: server1 configure and install plugin INSTALL PLUGIN group_replication SONAME 'group_replication.so'; SET @@SESSION.SQL_LOG_BIN=0; CREATE USER 'root'@'%' IDENTIFIED BY '${MYSQL_ROOT_PASSWORD}' ; GRANT ALL ON *.* TO 'root'@'%' WITH GRANT OPTION ; ALTER USER 'root'@'localhost' IDENTIFIED BY '${MYSQL_ROOT_PASSWORD}'; CREATE USER rpl_user@'%'; GRANT REPLICATION SLAVE ON *.* TO rpl_user@'%' IDENTIFIED BY 'rpl_pass'; FLUSH PRIVILEGES; SET SQL_LOG_BIN=1; CHANGE MASTER TO MASTER_USER='rpl_user', MASTER_PASSWORD='rpl_pass' FOR CHANNEL 'group_replication_recovery'; FLUSH PRIVILEGES ; Group/Server GTIDs: server1: server1_uuid:1-6 T3: server1 after restart mysqld This was triggered due to INSTALL PLUGIN group_replication SONAME 'group_replication.so', which did bootstrap group on server1. Group/Server GTIDs: server1: group_uuid:1, server1_uuid:1-6 T4: start server2, transactions created due docker statements Group/Server GTIDs: server1: group_uuid:1, server1_uuid:1-6 server2: server_uuid2:1-5 T5: server2 configure and install plugin INSTALL PLUGIN group_replication SONAME 'group_replication.so'; [...] FLUSH PRIVILEGES ; Group/Server GTIDs: server1: group_uuid:1, server1_uuid:1-6 server2: server2_uuid:1-6 T6: server2 after restart mysqld This was triggered due to INSTALL PLUGIN group_replication SONAME 'group_replication.so', which joined the server to the group. Group/Server GTIDs: server1: group_uuid:1-2, server1_uuid:1-6 server2: group_uuid:1-2, server1_uuid:1-6, server2_uuid:1-6 T7: start server3, transactions created due docker statements Group/Server GTIDs: server1: group_uuid:1-2, server1_uuid:1-6 server2: group_uuid:1-2, server1_uuid:1-6, server2_uuid:1-6 server3: server_uuid3:1-5 T8: server3 configure and install plugin INSTALL PLUGIN group_replication SONAME 'group_replication.so'; [...] FLUSH PRIVILEGES ; Group/Server GTIDs: server1: group_uuid:1-2, server1_uuid:1-6 server2: group_uuid:1-2, server1_uuid:1-6, server2_uuid:1-6 server3: server3_uuid:1-6 T9: server3 after restart mysqld. On this run server3 choose server2 as a donor This was triggered due to INSTALL PLUGIN group_replication SONAME 'group_replication.so', which joined the server to the group. Group/Server GTIDs: server1: group_uuid:1-3, server1_uuid:1-6 server2: group_uuid:1-3, server1_uuid:1-6, server2_uuid:1-6 server3: group_uuid:1-3, server1_uuid:1-6, server2_uuid:1-6, server3_uuid:1-6 T10: At this point the group is working with all members online. We insert all data on node3: create database test1; use test1; CREATE TABLE `t1` ( `id` int(11) NOT NULL AUTO_INCREMENT, `balance` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB; CREATE TABLE `t2` ( `id` int(11) NOT NULL AUTO_INCREMENT, `balance` int(11) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB; insert into t1 (balance) value (200); x 4 times insert into t1 (balance) select balance from t1; x 20 times Group/Server GTIDs: server1: group_uuid:1-30, server1_uuid:1-6 server2: group_uuid:1-30, server1_uuid:1-6, server2_uuid:1-6 server3: group_uuid:1-30, server1_uuid:1-6, server2_uuid:1-6, server3_uuid:1-6 T11: server3 executed: insert into t2 (balance) select balance from t1 This will set the version "group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6, server3_uuid:1-6" as the latest version of all rows for table t2 on the members certification DB. Group/Server GTIDs: server1: group_uuid:1-31, server1_uuid:1-6 server2: group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6 server3: group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6, server3_uuid:1-6 T12: on server3 begin; update t1 set balance=1000 where id < 70000000; Group/Server GTIDs: server1: group_uuid:1-31, server1_uuid:1-6 server2: group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6 server3: group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6, server3_uuid:1-6 T13: on server2 TX: update t2 set balance=117 where id > 10000000 and id < 10001000; ERROR 3101 (HY000): Plugin instructed the server to rollback the current transaction. Group/Server GTIDs: server1: group_uuid:1-31, server1_uuid:1-6 server2: group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6 server3: group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6, server3_uuid:1-6 Then we reach the problematic situation where a transaction is executed on server2, and it will try to commit a update to table t2 rows with the version "group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6". However the latest version on table t2 rows is "group_uuid:1-31, server1_uuid:1-6, server2_uuid:1-6, server3_uuid:1-6", hence since server2 did not see the transactions "server3_uuid:1-6", T13 is rolled back. From this point on, every transaction upon row X executed on server1 or server2 will rollback because its version, the current server GTID_EXECUTED, does not include the version of row X on certification DB. For all transactions on server2, their version of row X will always be server1_uuid:1-6,group_uuid:1-N, hence it doesn't include server3_uuid:1-6 and so they will always rollback. For a transaction to be considered non-conflicting, its version must include the version that it is on certification DB, meaning that the current transaction has seen the previous one. Example: certification DB: row X: AAA:1-4 transaction T1: version AAA:1-4: action: rollback transaction T2: version AAA:1-5: action: commit References ---------- [1] - https://dev.mysql.com/doc/refman/5.7/en/flush.html
[11 Aug 2017 2:42]
Vadim Tkachenko
Thank you for looking into! I made changes based on your recommendations and now I do not have this problem. However I hit a different bug, which I will report separately.