Description:
Query over InnoDB table that uses backward scan over the index occasionally
might return incorrect/incomplete results when changes to table (for example,
DELETEs in other or even the same connection followed by asynchronous purge)
cause concurrent B-tree page merges.
Unfortunately, I was not successful with building fully deterministic test case for this even using DEBUG_SYNC and debug-only features of MySQL/InnoDB. However, I have nondeterministic test case which fails for me all the time.
The high level idea behind the test case is that we have InnoDB table with PK which records are constantly modified causing page splits/merges. Concurrently we run query which uses backward scan on the PK and sometimes returns incomplete results.
A bit more detailed description.
--------------------------------
We have table with following definition:
CREATE TABLE testTable (`key` INT UNSIGNED, version BIGINT UNSIGNED, rowmarker CHAR(3) NOT NULL DEFAULT 'aaa', value MEDIUMBLOB, PRIMARY KEY (`key`, version)) ENGINE=InnoDB;
And two groups of records in this table.
1) Contains one record with key == 0 and one key == 2, which are constantly modified in two concurrent connections by stored procedures lft() and rght(), which only change contents of the value field to random value.
2) Contains several records with key == 1 with different 'version' values (so this group sits in B-tree between record in the 1)st group).
We have separate connection running mdl() procedure which constantly inserts records into this group with always decreasing version value. Occasionally (once in 20 inserts) it deletes all records from the group except the last one inserted (which causes page merge).
3) We have another connection which runs ck() procedure which checks consistency of this second group. Basically it runs SELECT which uses backward scan (and not PK lookup!) on PK to retrieve records from 2nd group and records what versions we have seen already. The idea is that if this SELECT returns record with certain value of version at some point, then next SELECTs should at least see records with this or smaller versions values.
However, this last assumption doesn't hold in 8.0.36 and 5.7.44, occasionally SELECT returns only records with versions values bigger than we have seen already or even empty result set.
How to repeat:
Run the attached test case for MTR framework and observe something like:
./mtr --mem innodb_backward_scan_conc_merge
Logging: .../mysql-test/mysql-test-run.pl --mem innodb_backward_scan_conc_merge
MySQL Version 8.0.36
Checking supported features
- Binaries are debug compiled
Using 'all' suites
Collecting tests
Checking leftover processes
Removing old var directory
Creating var directory '.../mysql-test/var'
- symlinking 'var' to '/dev/shm/var_auto_4lYa'
Installing system database
Using parallel: 1
==============================================================================
TEST NAME RESULT TIME (ms) COMMENT
------------------------------------------------------------------------------
[ 50%] main.innodb_backward_scan_conc_merge [ fail ]
Test ended at 2024-03-06 17:11:28
CURRENT_TEST: main.innodb_backward_scan_conc_merge
--- .../mysql-test/r/innodb_backward_scan_conc_merge.result 2024-03-06 18:27:03.633176272 +0300
+++ .../mysql-test/var/log/innodb_backward_scan_conc_merge.reject 2024-03-06 19:11:27.848594206 +0300
@@ -67,6 +67,8 @@
#
# Check procedure is not supposed to return ERROR.
CALL ck();
+msg local_max_seen global_max_seen
+ERROR! local_max_seen < global_max_seen! 64 65
#
# Stop activity in other connections
DELETE FROM stopper;
Note that internally test generates version numbers as increasing small integers and then converts them to always decreasing version by doing "2^64 - increasing_version", hence the error message.
Suggested fix:
Investigation shows that the problem occurs when persistent cursor which is used
to scan over index in backwards direction stops on infimum record of the page to which it points currently and releases all latches it has, before moving to the previous page.
At this point merge from the previous page to cursor's current one can happen
(because cursor doesn't hold latch on current or previous page). During this
merge records from the previous page are moved over infimum record and placed
before any old user records in the current page. When later our persistent
cursor resumes its iteration it might use optimistic approach to cursor
restoration which won't detect this kind of page update and resumes the
iteration right from infimum record, effectively skipping the moved records.
One possible approach to solve this problem is to increase modify_clock counter for the page being merged into in this case to ensure that pessimistic approach to cursor restoration, which is not affected by this problem, is used in this case.