Bug #105802 Slave stops with HA_ERR_KEY_NOT_FOUND (INDEX_SCAN,HASH_SCAN)
Submitted: 5 Dec 2021 12:43 Modified: 21 Dec 2021 1:27
Reporter: zhijun long Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.*; 8.0.* OS:Any
Assigned to: CPU Architecture:Any

[5 Dec 2021 12:43] zhijun long
Description:
Slave stops with HA_ERR_KEY_NOT_FOUND when slave_rows_search_algorithms is set to INDEX_SCAN,HASH_SCAN. And it is the default setting in version 8.0.

**** SHOW SLAVE STATUS on slave ****
SHOW SLAVE STATUS;
Slave_IO_State	Waiting for master to send event
Master_Host	127.0.0.1
Master_User	root
Master_Port	13000
Connect_Retry	1
Master_Log_File	master-bin.000001
Read_Master_Log_Pos	1264
Relay_Log_File	slave-relay-bin.000002
Relay_Log_Pos	985
Relay_Master_Log_File	master-bin.000001
Slave_IO_Running	Yes
Slave_SQL_Running	No
Replicate_Do_DB	
Replicate_Ignore_DB	
Replicate_Do_Table	
Replicate_Ignore_Table	
Replicate_Wild_Do_Table	
Replicate_Wild_Ignore_Table	
Last_Errno	1032
Last_Error	Could not execute Update_rows event on table uniticket.t1; Can't find record in 't1', Error_code: 1032; Can't find record in 't1', Error_code: 1032; handler error HA_ERR_KEY_NOT_FOUND; the event's master log master-bin.000001, end_log_pos 1233

How to repeat:
MTR test:

--source include/master-slave.inc
--connection master
create database uniticket;
use uniticket;

CREATE TABLE `t1` (
  `a` varchar(60) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `b` varchar(60) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `c` varchar(60) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `d` varchar(60) COLLATE utf8mb4_general_ci DEFAULT NULL,
  UNIQUE KEY `k1` (`a`,`b`,`c`)
) ENGINE=InnoDB;

INSERT INTO `t1` (`d`,`a`,`c`,`b`) VALUES('2021-12-04 16:42:53','A','B','A')  ,('2021-12-04 16:42:54','A','B','A')  ,('2021-12-04 16:42:53','A','B','A')  ,('2021-12-04 16:42:55','A','B','A') ON DUPLICATE KEY UPDATE `d`=VALUES(`d`),`a`=VALUES(`a`),`c`=VALUES(`c`),`b`=VALUES(`b`);

--source include/sync_slave_sql_with_master.inc
--connection master
DROP TABLE t1;
--source include/rpl_end.inc
[21 Dec 2021 1:31] MySQL Verification Team
Thanks for the test case. Verified.
[29 Mar 2022 22:40] Venkatesh Duggirala
This is a serious issue (given that slave_row_search_algorithm default is INDEX_SCAN,HASH_SCAN in 8.0) and should be a S2 issue (not s3).
[8 Nov 2023 16:06] Aaditya Dubey
Hi Team,

Any updates on the fix?
[13 Dec 2023 10:08] Varun Nagaraju
This is a very unique corner case in which there are multiple updates targeted at the same row in a single row log event and the slave_rows_search_algorithms system variable is set to HASH_SCAN(By default it is INDEX_SCAN, HASH_SCAN).
So, the test case can be simplified as

--source include/master-slave.inc
--connection slave
set @prior_slave_rows_search_algorithms=@@global.slave_rows_search_algorithms;
set @@global.slave_rows_search_algorithms= 'HASH_SCAN';
--connection master
create database uniticket;
use uniticket;
CREATE TABLE `t1` (
  `a` varchar(60) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `b` varchar(60) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `c` varchar(60) COLLATE utf8mb4_general_ci DEFAULT NULL,
  `d` varchar(60) COLLATE utf8mb4_general_ci DEFAULT NULL,
  UNIQUE KEY `k1` (`a`,`b`,`c`)
) ENGINE=InnoDB;
INSERT INTO 
  `t1` (`a`,`b`,`c`,`d`) 
VALUES
   ('A','A','B','2021-12-04 16:42:53'),
   ('A','A','B','2021-12-04 16:42:54'),
   ('A','A','B','2021-12-04 16:42:53'),
   ('A','A','B','2021-12-04 16:42:55')
ON DUPLICATE KEY UPDATE 
   `a`=VALUES(`a`),
   `b`=VALUES(`b`),
   `c`=VALUES(`c`),
   `d`=VALUES(`d`);

--echo # t1 contents on master
SELECT * FROM uniticket.t1;

--source include/sync_slave_sql_with_master.inc
--connection slave
--echo # t1 contents on slave
SELECT * FROM uniticket.t1;

--connection master
drop database uniticket;
--connection slave
set @@global.slave_rows_search_algorithms= @prior_slave_rows_search_algorithms;
--source include/rpl_end.inc

Hash scan generally works in the following way:
Whenever HASH_SCAN is set as the slave_row_search_algorithms, firstly, the hash table will be populated with the information from the write sets in the rows_log_event with
keys : Hash_slave_rows::make_hash_key(TABLE *table, MY_BITMAP *cols) dependant on the table object and the bitmap of cols from master.
values : A smart pointer to the positions of the before image of a write set.
Only after this hash table is generated from the write sets, the changes are attempted to be applied at Rows_log_event::do_scan_and_update(Relay_log_info const *rli).

int Rows_log_event::do_hash_scan_and_update(Relay_log_info const *rli) {
  DBUG_TRACE;
  assert(m_table && m_table->in_use != nullptr);
  // HASHING PART
  /* unpack the BI (and AI, if it exists) and add it to the hash map. */
  if (int error = this->do_hash_row(rli)) return error;
  /* We have not yet hashed all rows in the buffer. Do not proceed to the SCAN
   * part. */
  if (m_curr_row_end < m_rows_end) return 0;
  DBUG_PRINT("info", ("Hash was populated with %d records!", m_hash.size()));
  assert(m_curr_row_end == m_rows_end);
  // SCANNING & UPDATE PART
  return this->do_scan_and_update(rli);
}

do_scan_and_update(rli) performs a single table scan and for each row in the table scan, the hash table would be looked up for all the entries and the comparison of table record from storage engine and the before image from the hash entry is performed.
If there is a match, that particular hash entry is deleted and the write set to which the before image belonged to will be applied. This process will be repeated until there are no more entries in the hash.
(Error scenario : If the table scan completes and there are still some entries left in the hash, then it indicates an inconsistency between source and replica and fails.)

Coming to the corner case of multiple updates targeted at the same row in a single row log event, there have been some work done in that area. For e.g:-

    https://github.com/mysql/mysql-server/commit/dfd8d7d89bfb7cbde7c0e45827bcb611dcb8f36e
    https://github.com/mysql/mysql-server/commit/af5bfba0e28cd6839ce4e902ca89b023244efca0
    https://github.com/mysql/mysql-server/commit/1d2ad39cf2e7d978ebdf162f6b31cd3a13cfcb6c

But, they haven't covered this exact case when there are multiple entries in the hash with the same key(which means the before image of more than 1 write sets matches the content of the table record).
The relay log will contain 1 insert and 3 update write sets as expected.

'/*!*/;
### INSERT INTO `uniticket`.`t1`
### SET
###   @1='A'
###   @2='A'
###   @3='B'
###   @4='2021-12-04 16:42:53'
### UPDATE `uniticket`.`t1`
### WHERE
###   @1='A'
###   @2='A'
###   @3='B'
###   @4='2021-12-04 16:42:53'
### SET
###   @1='A'
###   @2='A'
###   @3='B'
###   @4='2021-12-04 16:42:54'
### UPDATE `uniticket`.`t1`
### WHERE
###   @1='A'
###   @2='A'
###   @3='B'
###   @4='2021-12-04 16:42:54'
### SET
###   @1='A'
###   @2='A'
###   @3='B'
###   @4='2021-12-04 16:42:53'
### UPDATE `uniticket`.`t1`
### WHERE
###   @1='A'
###   @2='A'
###   @3='B'
###   @4='2021-12-04 16:42:53'
### SET
###   @1='A'
###   @2='A'
###   @3='B'
###   @4='2021-12-04 16:42:55'

The internal implementation of the hash being used here is an unordered_multimap.

    The delete method is made sure to delete a single entry at a time instead of the default behaviour of deleting all the entries with the given key.
    But, the get method of hash class Hash_slave_rows doesn't guarantee to fetch the entries in the order they were inserted.

So, when the table scan starts and for the first and single row present in the table, the table's record and the before image  of the entry fetched from the hash will match. But, since the hash table has more than one entry with the same before image and in turn the same key, the wrong entry(corresponding to a different write set) is fetched from the hash table. It is the deleted from the hash table and changes will be applied and this causes other updates of the hash to NOT apply since the before image of the none of the entries in the hash table match with the no corresponding rows on the slave table.
This is the reason a failure with HA_KEY_NOT_FOUND is seen.

The above behaviour can be verified using the following debug patch

diff --git a/sql/log_event.cc b/sql/log_event.cc
index 4a0322dd131..95f12bd628c 100644
--- a/sql/log_event.cc
+++ b/sql/log_event.cc
@@ -9337,6 +9337,7 @@ int Rows_log_event::do_hash_row(Relay_log_info const *rli) {
     key based on record[0] contents (including BLOB fields).
    */
   m_hash.put(m_table, &this->m_local_cols, entry);
+  rli->report(INFORMATION_LEVEL, 0, "debug: Added record to hash with key= %u values: bi_start = %p bi_ends = %p", entry->preamble->hash_value, entry->positions->bi_start, entry->positions->bi_ends);
 
   if (m_key_index < MAX_KEY) add_key_to_distinct_keyset();
 
@@ -9441,7 +9442,7 @@ int Rows_log_event::do_scan_and_update(Relay_log_info const *rli) {
         */
         do {
           store_record(table, record[1]);
-
+          rli->report(INFORMATION_LEVEL, 0, "debug: Fetched entry key = %u values: bi_start = %p bi_ends = %p", entry->preamble->hash_value, entry->positions->bi_start, entry->positions->bi_ends);
           /**
              If there are collisions we need to be sure that this is
              indeed the record we want.  Loop through all records for
[18 Dec 2023 13:39] Varun Nagaraju
A possible way to go about implementing a solution for this issue is transforming/optimizing the write set information into less ambiguous write sets before storing them in the hash table which can be applied cleanly without any conflicts/errors.
For example, If we consider the testcase mysql-test/suite/rpl/t/rpl_row_multi_update_of_same_row.test which was added as part of dfd8d7d89bfb7cbde7c0e45827bcb611dcb8f36e

CREATE TABLE t1 (i INT); INSERT INTO `t1` VALUES (1), (2), (3), (4);

and if we consider the stored procedure to have the following updates in a single rows log event,

UPDATE t1 SET i = 1 WHERE i = 4;
UPDATE t1 SET i = 2 WHERE i = 1;
UPDATE t1 SET i = 3 WHERE i = 2;
UPDATE t1 SET i = 2 WHERE i = 3;
UPDATE t1 SET i = 1 WHERE i = 2;
UPDATE t1 SET i = 4 WHERE i = 1;

The write sets in the rows log event would be something like this and so would be the hash table

Write set                                                                                                                                                                                                       
4 1                                                                                                                                                                                                            
1 2                                                                                                                                                                                                            
2 3                                                                                                                                                                                                            
3 2                                                                                                                                                                                                            
2 1                                                                                                                                                                                                            
1 4 

Here, the write sets can be optimized.
we start from the first element
4 1 <- here
1 2
2 3
3 2
2 1
1 4

Then we move to the second element (1 2) and extract its Before Image (1 in this case)
4 1
1 2 <- here  (BI = 1)
2 3
3 2
2 1
1 4

and see that before <here> there was already an element that has AI identical to the BI in <here> (1 in this case), which meant that that previous element (4 1) can be replaced with (4 2).
After that replace we should move to the next element
4 2
1 2
2 3 <- here  (BI = 2)
3 2
2 1
1 4

If we continue the procedure (4 2) will become (4 3) and (1 2) will become (1 3)
4 3
1 3
2 3
3 2 <- here  (BI = 3)
2 1
1 4

Then
4 2
1 2
2 2
3 2
2 1 <- here  (BI = 2)
1 4

(notice here that there is an element with identical BI and AI, which is OK for an intermediate step)
After that
4 1
1 1
2 1
3 1
2 1
1 4 <- here  (BI = 1)

And finally
4 4
1 4
2 4
3 4
2 4
1 4
done <-here 

We need to do a few more steps here
1. Eliminate elements with identical BI and AI -  (4 4) in this case
   1 4
   2 4
   3 4
   2 4
   1 4

2. Eliminate elements with identical (BI AI) pairs. Please, notice that after this transformation there should be no pairs with identical BI but different AI (edited) 
   1 4
   2 4
   3 4
and this is the final result.

One more thing to do while transforming the write sets, is that we should mark the elements in the write set with additional flag (original / modified). At the beginning - everything should be original. On each step, if we optimize an element, we mark it with modified. This is useful for cases like the following

If we consider the sample example in this bug report and simplify it a little bit as follows

CREATE TABLE t1 (i INT); INSERT INTO `t1` VALUES (1);

and if we consider the stored procedure to have the following updates in a single rows log event,

UPDATE t1 SET i = 2 WHERE i = 1;
UPDATE t1 SET i = 1 WHERE i = 2;
UPDATE t1 SET i = 3 WHERE i = 1;

The write sets in the rows log event would be something like this and so would be the hash table
Write set                                                                                                                                                                                                       
1  2                                                                                                                                                                                                            
2  1                                                                                                                                                                                                            
1  3 

Here, before inserting entries to the hash table, 
1 2 (original) <-
2 1 (original)
1 3 (original)

1 1 (modified)
2 1 (original) <-
1 3 (original)

1 3 (modified)
2 3 (modified)
1 3 (original) <-

While removing, we now need to check for the original / modified flag as well.
Basically for all the records with identical (BI AI) image, we need to leave only the earliest and mark it with original only if all other identical elements were also original. Otherwise (if at least one of them was modified), the remaining element needs to be marked modified.

In our example we remove the third element 1 3 (original) and leave the first one 1 3 (modified) (marking it modified despite the fact that it is already modified).
1 3 (modified)
2 3 (modified)

Now, when applying this optimized write-set, it should be considered OK (should not trigger an error) to encounter a element that does not have a BI corresponding to this element if it is marked modified.