Description:
Hi,
This bug is a variation, for Group Replication, of Bug#86078 (Bad Write Set tracking with UNIQUE KEY on a DELETE followed by an INSERT). This will cause Group Replication to report the following line will be in the error log:
2018-01-01T18:29:30.880298Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
Note: Group Replication auto-magically restarted/retries the failed transaction. This is another bug that I will open and I will add the bug number in the comments.
See below blog post for more details:
https://jfg-mysql.blogspot.com/2018/01/write-set-in-mysql-5-7-group-replication.html
Many thanks for looking into that,
JFG
How to repeat:
Deploy an instance of Group Replication.
Run the following on the primary member of the group:
> SELECT version();
+------------+
| version() |
+------------+
| 5.7.20-log |
+------------+
1 row in set (0.00 sec)
> CREATE DATABASE test_jfg_ws;
Query OK, 1 row affected (0.01 sec)
> CREATE TABLE test_jfg_ws.test_jfg_ws (
-> id int(10) unsigned NOT NULL AUTO_INCREMENT PRIMARY KEY,
-> str varchar(80) NOT NULL UNIQUE);
Query OK, 0 rows affected (0.01 sec)
> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('c');
Query OK, 1 row affected (0.00 sec)
Run the following on a secondary member of the group, with the "STOP GROUP_REPLICATION; START GROUP_REPLICATION;" to rotate the relay logs of the applier (FLUSH RELAY LOGS does not work with Group Replication - I will also open a bug/feature request on that and add the bug number in the comments):
> SELECT version();
+------------+
| version() |
+------------+
| 5.7.20-log |
+------------+
1 row in set (0.00 sec)
> STOP GROUP_REPLICATION; START GROUP_REPLICATION;
Query OK, 0 rows affected (9.26 sec)
Query OK, 0 rows affected (3.16 sec)
Run the following on the primary:
> DELETE FROM test_jfg_ws.test_jfg_ws WHERE str = 'c';
Query OK, 1 row affected (0.00 sec)
> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('C');
Query OK, 1 row affected (0.00 sec)
We will have the following in the relay logs of the secondary (notice the overlapping intervals for the DELETE and the INSERT - comment at the end of the lines added for clarity):
# mysqlbinlog -vvv my_relaylog-group_replication_applier.N | grep -e last_ |
sed -e 's/server id.*last/[...] last/' -e 's/.rbr_only.*/ [...]/'
#180101 15:06:45 [...] last_committed=0 sequence_number=0 [...]
#180101 15:06:45 [...] last_committed=1 sequence_number=2 [...] -- DELETE c
#180101 15:06:45 [...] last_committed=1 sequence_number=3 [...] -- INSERT C
With the above intervals, the DELETE and the INSERT can be run in parallel. If the INSERT is run before the row is actually deleted, there will be a duplicate key violation.
To trigger the duplicate key violation, you can create a table with many rows (one million in my case) by running the following commands in a Linux shell on the primary:
# mysql <<< "
> CREATE TABLE test_jfg_ws.test_jfg_ws2 (
> id INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
> str VARCHAR(80) NOT NULL)"
# sql="INSERT INTO test_jfg_ws.test_jfg_ws2 (str) VALUES (RAND());"
# for i in $(seq 100); do
> ( echo "BEGIN;"; yes "$sql" | head -n 10000; echo "COMMIT;"; ) | mysql;
> done
The following ALTER needs to take more than two seconds so we have a chance to trigger the duplicate key violation. If it does not take more than two seconds, add more rows in the table by running the above for loop again.
> ALTER TABLE test_jfg_ws.test_jfg_ws2 MODIFY COLUMN str VARCHAR(60);
Query OK, 1000000 rows affected (3.57 sec)
Records: 1000000 Duplicates: 0 Warnings: 0
Enable parallel replication on a secondary member of the group and look at the status of the applier:
> STOP GROUP_REPLICATION;
Query OK, 0 rows affected (9.83 sec)
> SET GLOBAL slave_parallel_type = LOGICAL_CLOCK;
Query OK, 0 rows affected (0.00 sec)
> SET GLOBAL slave_parallel_workers = 8;
Query OK, 0 rows affected (0.00 sec)
> START GROUP_REPLICATION;
Query OK, 0 rows affected (3.13 sec)
> SELECT * FROM performance_schema.replication_applier_status
-> WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
SERVICE_STATE: ON
REMAINING_DELAY: NULL
COUNT_TRANSACTIONS_RETRIES: 0
1 row in set (0.00 sec)
> SELECT * FROM performance_schema.replication_applier_status_by_coordinator
-> WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
THREAD_ID: 1262
SERVICE_STATE: ON
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
1 row in set (0.00 sec)
Try triggering, on the secondary, Bug#86078 by running the following commands in a Linux shell of the primary:
# for i in $(seq 5); do
> mysql <<< "
> SET autocommit = ON;
> ALTER TABLE test_jfg_ws.test_jfg_ws2 MODIFY COLUMN str VARCHAR(80);
> DELETE FROM test_jfg_ws.test_jfg_ws WHERE str = 'c';
> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('C');
> ALTER TABLE test_jfg_ws.test_jfg_ws2 MODIFY COLUMN str VARCHAR(60);
> DELETE FROM test_jfg_ws.test_jfg_ws WHERE str = 'C';
> INSERT INTO test_jfg_ws.test_jfg_ws (str) VALUES ('c');"
> done
After running above, I get the following on the secondary where I enabled parallel replication:
> SELECT * FROM performance_schema.replication_applier_status
-> WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
SERVICE_STATE: ON
REMAINING_DELAY: NULL
COUNT_TRANSACTIONS_RETRIES: 4
1 row in set (0.00 sec)
> SELECT * FROM performance_schema.replication_applier_status_by_coordinator
-> WHERE CHANNEL_NAME = 'group_replication_applier'\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
THREAD_ID: 4531
SERVICE_STATE: ON
LAST_ERROR_NUMBER: 1062
LAST_ERROR_MESSAGE: Coordinator stopped because there were error(s) in the worker(s).
The most recent failure being: Worker 2 failed executing transaction 'UUID:147' at
master log , end_log_pos 168. See error log and/or performance_schema.replication_applier_status_by_worker
table for more details about this failure or others, if any.
LAST_ERROR_TIMESTAMP: 2018-01-01 19:29:30
1 row in set (0.00 sec)
I also got below in the error log:
2018-01-01T18:29:04.847233Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:129' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-01T18:29:13.386985Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:135' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-01T18:29:25.979746Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:144' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'C' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
2018-01-01T18:29:30.880298Z 4499 [ERROR] Slave SQL for channel 'group_replication_applier': Worker 2 failed executing transaction 'UUID:147' at master log , end_log_pos 168; Could not execute Write_rows event on table test_jfg_ws.test_jfg_ws; Duplicate entry 'c' for key 'str', Error_code: 1062; handler error HA_ERR_FOUND_DUPP_KEY; the event's master log FIRST, end_log_pos 168, Error_code: 1062
So I got the error 4 times on a potential of 10. In [1], I got the error 7 times on a potential of 20.
[1]: https://jfg-mysql.blogspot.com/2018/01/write-set-in-mysql-5-7-group-replication.html
Suggested fix:
Backport the solution to Bug#86078 in MySQL 5.7.