Bug #114248 Incomplete query results when backward scan and concurrent page merge
Submitted: 6 Mar 16:22 Modified: 7 Mar 13:57
Reporter: Dmitry Lenev (OCA) Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0.36, 5.7.44 OS:Any
Assigned to: CPU Architecture:Any
Tags: Contribution

[6 Mar 16:22] Dmitry Lenev
Query over InnoDB table that uses backward scan over the index occasionally
might return incorrect/incomplete results when changes to table (for example,
DELETEs in other or even the same connection followed by asynchronous purge)
cause concurrent B-tree page merges.

Unfortunately, I was not successful with building fully deterministic test case for this even using DEBUG_SYNC and debug-only features of MySQL/InnoDB. However, I have nondeterministic test case which fails for me all the time.

The high level idea behind the test case is that we have InnoDB table with PK which records are constantly modified causing page splits/merges. Concurrently we run query which uses backward scan on the PK and sometimes returns incomplete results.

A bit more detailed description.

We have table with following definition:

CREATE TABLE testTable (`key` INT UNSIGNED, version BIGINT UNSIGNED, rowmarker CHAR(3) NOT NULL DEFAULT 'aaa', value MEDIUMBLOB, PRIMARY KEY (`key`, version)) ENGINE=InnoDB;

And two groups of records in this table.

1) Contains one record with key == 0 and one key == 2, which are constantly modified in two concurrent connections by stored procedures lft() and rght(), which only change contents of the value field to random value.
2) Contains several records with key == 1 with different 'version' values (so this group sits in B-tree between record in the 1)st group).
We have separate connection running mdl() procedure which constantly inserts records into this group with always decreasing version value. Occasionally (once in 20 inserts) it deletes all records from the group except the last one inserted (which causes page merge).
3) We have another connection which runs ck() procedure which checks consistency of this second group. Basically it runs SELECT which uses backward scan (and not PK lookup!) on PK to retrieve records from 2nd group and records what versions we have seen already. The idea is that if this SELECT returns record with certain value of version at some point, then next SELECTs should at least see records with this or smaller versions values.

However, this last assumption doesn't hold in 8.0.36 and 5.7.44, occasionally SELECT returns only records with versions values bigger than we have seen already or even empty result set.

How to repeat:
Run the attached test case for MTR framework and observe something like:

./mtr --mem innodb_backward_scan_conc_merge
Logging: .../mysql-test/mysql-test-run.pl  --mem innodb_backward_scan_conc_merge
MySQL Version 8.0.36
Checking supported features
 - Binaries are debug compiled
Using 'all' suites
Collecting tests
Checking leftover processes
Removing old var directory
Creating var directory '.../mysql-test/var'
 - symlinking 'var' to '/dev/shm/var_auto_4lYa'
Installing system database
Using parallel: 1

                  TEST NAME                       RESULT  TIME (ms) COMMENT
[ 50%] main.innodb_backward_scan_conc_merge      [ fail ]
        Test ended at 2024-03-06 17:11:28

CURRENT_TEST: main.innodb_backward_scan_conc_merge
--- .../mysql-test/r/innodb_backward_scan_conc_merge.result	2024-03-06 18:27:03.633176272 +0300
+++ .../mysql-test/var/log/innodb_backward_scan_conc_merge.reject	2024-03-06 19:11:27.848594206 +0300
@@ -67,6 +67,8 @@
 # Check procedure is not supposed to return ERROR.
 CALL ck();
+msg	local_max_seen	global_max_seen
+ERROR! local_max_seen < global_max_seen!	64	65
 # Stop activity in other connections
 DELETE FROM stopper;

Note that internally test generates version numbers as increasing small integers and then converts them to always decreasing version by doing "2^64 - increasing_version", hence the error message.

Suggested fix:
Investigation shows that the problem occurs when persistent cursor which is used 
to scan over index in backwards direction stops on infimum record of the page to which it points currently and releases all latches it has, before moving to the previous page.
At this point merge from the previous page to cursor's current one can happen
(because cursor doesn't hold latch on current or previous page). During this
merge records from the previous page are moved over infimum record and placed
before any old user records in the current page. When later our persistent
cursor resumes its iteration it might use optimistic approach to cursor
restoration which won't detect this kind of page update and resumes the
iteration right from infimum record, effectively skipping the moved records.

One possible approach to solve this problem is to increase modify_clock counter for the page being merged into in this case to ensure that pessimistic approach to cursor restoration, which is not affected by this problem, is used in this case.
[6 Mar 16:23] Dmitry Lenev
Non-deterministic test case showing the problem.

Attachment: innodb_backward_scan_conc_merge.test (application/octet-stream, text), 3.13 KiB.

[6 Mar 16:24] Dmitry Lenev
Result file for test case.

Attachment: innodb_backward_scan_conc_merge.result (application/octet-stream, text), 2.57 KiB.

[6 Mar 16:29] MySQL Verification Team
Hello Dmitry,

Thank you for the report and testcase.

[7 Mar 13:55] Dmitry Lenev
Minimal tentative fix for the problem.

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: ps-9092-bug114248-contrib.patch (text/x-patch), 11.30 KiB.

[7 Mar 13:57] Dmitry Lenev

I have submitted patch implementing the idea from "Suggested fix" section for Oracle MySQL Team consideration. The patch is pretty minimal and only tries to address issue we have observed, so it is probably worth to investigate if there are any other similar issues/alternative approaches to the fix.
[7 Mar 14:05] MySQL Verification Team
Thank you, Dmitry for the contribution.