Bug #115741 Slave stops with HA_ERR_KEY_NOT_FOUND with HASH_SCAN and index used to search fo
Submitted: 1 Aug 2024 11:13 Modified: 7 Aug 2024 15:15
Reporter: Jan Lindström Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[1 Aug 2024 11:13] Jan Lindström
Description:
Changing the available MTR test:

{noformat}
diff --git a/mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test b/mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test
index 793e56e1a6a..4a89b0bdb29 100644
--- a/mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test
+++ b/mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test
@@ -37,7 +37,8 @@ CREATE FUNCTION f1 () RETURNS INT BEGIN
   UPDATE t1 SET i = 3 WHERE i = 2;
   UPDATE t1 SET i = 4 WHERE i = 3;
   UPDATE t1 SET i = 5 WHERE i = 4;
-  UPDATE t2 SET j = 2 WHERE j = 1;
+  UPDATE t2 SET j = 0 WHERE j = 1;
+  UPDATE t2 SET j = 2 WHERE j = 0;
   UPDATE t2 SET j = 3 WHERE j = 2;
   UPDATE t2 SET j = 4 WHERE j = 3;
   UPDATE t2 SET j = 5 WHERE j = 4;

leads to:

CURRENT_TEST: rpl.rpl_row_multi_update_of_same_row
mysqltest: At line 158: Error in sync_with_master.inc
In included file ./include/sync_slave_sql.inc: 159
included from ./include/sync_slave_sql_with_master.inc: 79
included from /xxx/suite/rpl/t/rpl_row_multi_update_of_same_row.test: 89

The result from queries just before the failure was:
**** mysql.gtid_executed on server_2 ****
SELECT * FROM mysql.gtid_executed;
source_uuid     interval_start  interval_end

rpl_topology=1->2
extra debug info if any: ''
connection default;
1 tests executed in this mtr thread: rpl.rpl_row_multi_update_of_same_row
last error log entry: SERVER_1:<none>; SERVER_2:<Thread: 13 Error: MY-010584: Replica SQL for channel '': Worker 1 failed executing transaction 'ANONYMOUS' at source log master-bin.000001, end
_log_pos 2080; Could not execute Update_rows event on table test.t2; Can't find record in 't2', Error_code: 1032; Can't find record in 't2', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUN
D; the event's source log master-bin.000001, end_log_pos 2080, Error_code: MY-001032>
rpl error summary: SERVER_2:(WORKERS:(CHANNEL:<> WORKER:1 ERROR:<Worker 1 failed executing transaction 'ANONYMOUS' at source log master-bin.000001, end_log_pos 2080; Could not execute Update_r
ows event on table test.t2; Can't find record in 't2', Error_code: 1032; Can't find record in 't2', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's source log master-bin.0000
01, end_log_pos 2080>) COORDINATORS:(CHANNEL:<> ERROR:<Coordinator stopped because there were error(s) in the worker(s). The most recent failure being: Worker 1 failed executing transaction 'A
NONYMOUS' at source log master-bin.000001, end_log_pos 2080. See error log and/or performance_schema.replication_applier_status_by_worker table for more details about this failure or others, i
f any.>))
connection slave;
ERROR: sync_slave_sql.inc failed on connection 'slave'
ERROR: use_gtids='0'
ERROR: _saved_gtids=''
ERROR: _saved_file='master-bin.000001'
ERROR: _saved_pos='2111'
ERROR: _saved_channel_name=
ERROR: timeout='300'
ERROR: result='-2'
ERROR: error type: The slave SQL thread was not started, the slave's master information was not initialized, the arguments were incorrect, or an error occurred.
safe_process[8745]: Child process: 8746, exit: 1
{noformat}

The same effect may be achieved when the table definition is as following (no-PK table with an index):
CREATE TABLE t1 (a INT, INDEX t1_i(a)) ENGINE=InnoDB;

The reason is that there are multiple updates to the same row, and one of the intermediate updates has a key that sorts before the original key value of the row.
This leads to such a non-existent key be searched first in the index, thus resulting in an error.

See m_distinct_keys set declaration (comparator) and its usage in log_event.cc

There is some other problems related to this algorithm of updating rows, they mostly come from the same scenario of updating a row multiple times. Looks like as I initially assumed the algorithm is not correct, and I think a radical fix is needed to make it work.

How to repeat:
{noformat}
diff --git a/mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test b/mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test
index 793e56e1a6a..4a89b0bdb29 100644
--- a/mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test
+++ b/mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test
@@ -37,7 +37,8 @@ CREATE FUNCTION f1 () RETURNS INT BEGIN
   UPDATE t1 SET i = 3 WHERE i = 2;
   UPDATE t1 SET i = 4 WHERE i = 3;
   UPDATE t1 SET i = 5 WHERE i = 4;
-  UPDATE t2 SET j = 2 WHERE j = 1;
+  UPDATE t2 SET j = 0 WHERE j = 1;
+  UPDATE t2 SET j = 2 WHERE j = 0;
   UPDATE t2 SET j = 3 WHERE j = 2;
   UPDATE t2 SET j = 4 WHERE j = 3;
   UPDATE t2 SET j = 5 WHERE j = 4;
{noformat}

./mtr rpl.rpl_row_multi_update_of_same_row
[2 Aug 2024 15:46] Jean-François Gagné
This looks like a duplicate of Bug#105802.
[5 Aug 2024 12:33] MySQL Verification Team
Hi Jan, thanks for the report. As JFG said this does look like duplicate of #105802 and I cannot reproduce this on latest 8.0. What version did you reproduce this on?

Thanks
[6 Aug 2024 9:57] Jan Lindström
he issue reproduces on mysql-8.0.39 tag debug built from the official repository.

Here's the refined reproducer for this issue:

{noformat}
--source include/have_binlog_format_row.inc
--source include/set_privilege_checks_user_as_system_user.inc
--source include/master-slave.inc

--source include/rpl_connection_slave.inc

--source include/rpl_connection_master.inc
# Create a table, with no primary key and an index.
CREATE TABLE t1 (i INT, INDEX t1_i(i));

# Create a stored function so that only one Update_rows_log_event is generated.
--delimiter |
CREATE FUNCTION f1 () RETURNS INT BEGIN
  UPDATE t1 SET i = 0 WHERE i = 1;
  UPDATE t1 SET i = 2 WHERE i = 0;
  UPDATE t1 SET i = 3 WHERE i = 2;
  UPDATE t1 SET i = 4 WHERE i = 3;
  UPDATE t1 SET i = 5 WHERE i = 4;
  RETURN 0;
END|
--delimiter ;
--source include/sync_slave_sql_with_master.inc

--source include/rpl_connection_master.inc
INSERT INTO t1 VALUES (1);
SELECT f1();

--source include/sync_slave_sql_with_master.inc
--let $assert_text= 'There is only one row in table t1'
--let $assert_cond= [SELECT COUNT(*) FROM t1] = 1
--source include/assert.inc

--source include/rpl_connection_master.inc

# Verify that there is no difference between tables of master and slave.
--let diff_tables=master:test.t1, slave:test.t1
--source include/diff_tables.inc

# Cleanup
--source include/rpl_connection_master.inc
DROP FUNCTION f1;
DROP TABLE t1;

--source include/rpl_connection_slave.inc
--source include/rpl_end.inc
{noformat}

We also analyzed #105802 ticket.

Here's the reproducer for #105802 we used and that is described in the comment of the corresponding ticket:

{noformat}
--source include/have_binlog_format_row.inc
--source include/set_privilege_checks_user_as_system_user.inc
--source include/master-slave.inc

--source include/rpl_connection_slave.inc

--source include/rpl_connection_master.inc
# Create a table, with no primary key.
CREATE TABLE t1 (i INT);

# Create a stored function so that only one Update_rows_log_event is generated.
--delimiter |
CREATE FUNCTION f1 () RETURNS INT BEGIN
  UPDATE t1 SET i = 1 WHERE i = 4;
  UPDATE t1 SET i = 2 WHERE i = 1;
  UPDATE t1 SET i = 3 WHERE i = 2;
  UPDATE t1 SET i = 2 WHERE i = 3;
  UPDATE t1 SET i = 1 WHERE i = 2;
  UPDATE t1 SET i = 4 WHERE i = 1;
  RETURN 0;
END|
--delimiter ;
--source include/sync_slave_sql_with_master.inc

--source include/rpl_connection_master.inc
INSERT INTO t1 VALUES (1), (2), (3), (4);
SELECT f1();

--source include/sync_slave_sql_with_master.inc
--let $assert_text= 'There are only four rows in table t1'
--let $assert_cond= [SELECT COUNT(*) FROM t1] = 4
--source include/assert.inc

--source include/rpl_connection_master.inc

# Verify that there is no difference between tables of master and slave.
--let diff_tables=master:test.t1, slave:test.t1
--source include/diff_tables.inc

# Cleanup
--source include/rpl_connection_master.inc
DROP FUNCTION f1;
DROP TABLE t1;

--source include/rpl_connection_slave.inc
--source include/rpl_end.inc
{noformat}

Both reproducers fail, but they fail for two different reasons.

The problem with #105802 is that intermediate updates to the same row might not be applied (skipped), and updates that left result in the error that not all entries are applied from the log event.

The current issue is different and it's only reproduced when there's an index.
The problem is that one of the intermediate updates to the same row has a key that sorts before the original key value of the row and then it's searched first in the index, generating key not found error.

Two issues look similar because they have the same root cause: multiple updates to the same row within a single update rows log event.
But they are different, please, try the refined tests once again.
[6 Aug 2024 10:44] MySQL Verification Team
Hi Jan,

I'll retest with your new test case on debug version of .39. Thanks.
[7 Aug 2024 15:15] MySQL Verification Team
Verified as described
[22 Aug 2024 6:55] huahua xu
Hi, all:

It may be helpful to solving the problem: https://bugs.mysql.com/bug.php?id=106918
[1 Jan 16:34] Varun Nagaraju
When a replica's slave_rows_search_algorithms is set to HASH_SCAN, it prepares a unique key list for all the rows in a particular Row_event. The same unique key list will later be used to retrieve all tuples associated to each key in the list from storage engine. In the case of multiple updates targeted at the same row like how it is shown in the testcase, it may happen that this unique key list filled with entries which don't exist yet in the table. This is a problem when there is an intermediate update which changes the value of the index column to a lesser value than the original entry and that changed value is used in another update. It is an issue because the unique key list is a std::set which internally sorts it's entries. When this sorting happens, the first entry of the list could potentially be a value which doesn't exist in the table and the when it is searched in next_record_scan() method, it fails returning HA_ERR_KEY_NOT_FOUND error.  
So, Instead of using std::set to store the distinct keys, a combination of unordered_set and a list is used to preserve the original order of updates and avoid duplicates at the same time which prevents the side effects of sorting.
[1 Jan 16:34] Varun Nagaraju
adding the patch as contribution

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-PS-9391-Fixes-replication-break-error-because-of-HAS.patch (text/x-patch), 12.74 KiB.

[21 Jan 12:46] Varun Nagaraju
Contributing a refactored patch based on v8.0.40

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 0001-PS-9391-Fixes-replication-break-error-because-of-HAS.patch (text/x-patch), 13.52 KiB.