MySQL Bugs: #117218: Read a value that should be deleted In RR

Bug #117218	Read a value that should be deleted In RR
Submitted:	16 Jan 10:26	Modified:	28 Jan 10:29
Reporter:	Huicong Xu	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	9.1.0	OS:	Ubuntu (22.04)
Assigned to:		CPU Architecture:	Any
Tags:	transaction

Description:
When I tested Repeatable Read isolation levels on Mariadb, I found a like lost update exception.
In a transaction, I used a table-wide delete statement, but was still able to read data from the table, which should have been deleted. If I use this result to modify the data, it will result in a serious error.
A simplified test case with result is as follows.
In the last query of session 2, I should see an empty set.Therefore, it seems to me that this is a logical bug for the isolation level.

--- session 0 ---
CREATE TABLE t0(c0 INT primary key);
INSERT INTO t0 VALUES (1);
INSERT INTO t0 VALUES (-1);
 
--- session 1 ---
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
SELECT * FROM t0;
+----+
| c0 |
+----+
| -1 |
|  1 |
+----+
2 rows in set (0.00 sec)
--- session 2 ---
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN; 
SELECT * FROM t0;
+----+
| c0 |
+----+
| -1 |
|  1 |
+----+
2 rows in set (0.00 sec)
--- session 1 ---
UPDATE t0 SET c0 = 2 where c0 = 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
SELECT * FROM t0;
+----+
| c0 |
+----+
| -1 |
|  2 |
+----+
2 rows in set (0.00 sec)
COMMIT;
--- session 2 ---
DELETE FROM t0;
Query OK, 2 rows affected (0.00 sec)
SELECT * FROM t0;
+----+
| c0 |
+----+
|  1 |
+----+
1 row in set (0.00 sec)
COMMIT;

Expect to see:
..
--- session 2 ---
DELETE FROM t0;
Query OK, 2 rows affected (0.00 sec)
SELECT * FROM t0;
Empty set (0.00 sec)

How to repeat:
--- session 0 ---
CREATE TABLE t0(c0 INT primary key);
INSERT INTO t0 VALUES (1);
INSERT INTO t0 VALUES (-1);
--- session 1 ---
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN;
SELECT * FROM t0;
--- session 2 ---
SET SESSION TRANSACTION ISOLATION LEVEL REPEATABLE READ;
BEGIN; 
SELECT * FROM t0;
--- session 1 ---
UPDATE t0 SET c0 = 2 where c0 = 1;
Query OK, 1 row affected (0.00 sec)
Rows matched: 1  Changed: 1  Warnings: 0
SELECT * FROM t0;
COMMIT;
--- session 2 ---
DELETE FROM t0;
Query OK, 2 rows affected (0.00 sec)
SELECT * FROM t0;
COMMIT;

Suggested fix:
This seems to be a problem caused by the index.

Thank you for the report.

Posted by developer:
 
I know it hurts, and I know it is counter-intuitive behaviour, but it is "not a bug", it works as intended.
Here's why.

First, notice that UPDATE of primary key value is mostly equivalent to a pair of DELETE and INSERT, or at least that's what happening at lowest level - we have to delete the row from one place in the B-tree and insert another one in another place. To support MVCC and rollback, we do not physically remove the old one right away, instead we delete-mark, and install a rollbackptr pointing to the Undo Log Record explaining that it was delete-marked, so if you want to see the older version of the db you have to "undeletemark it".

Second, notice that REPEATABLE READ is defined to behave differently for 
- locking operations (such as DELETE FROM t0) where it has to behave like SERIALIZABLE, i.e. take locks on records and gaps, and present the most recently committed version of the db
- non-locking operations (such as SELECT * FROM t0) where it has to present the state of the db as of the moment when the read-view was created PLUS its own changes

...and the scenario exercises all of that at once.

So, what happens is:
1. Trx2 establishes a read-view. This read view doesn't see effects of Trx1. But must see effects of Trx2 which owns the view. At this moment the state of the db is {-1, 1}
2. Trx1 is delete-marking the row with c0==1, and inserting a new one with c0==2.
3. Trx1 commits.
Note: if Trx2 executed SELECT * FROM t0 now, it would still see {-1,1} thanks to the undo log chain hanging off the delete-marked record c0==1 which says how to undelete it, and undo log chain hanging off c0==2 which says how to "uninsert" it.
Note: if Trx2 executed SELECT * FROM t0 FOR SHARE, it would be a locking select which uses SERIALIZABLE-like behaviour, so would show {-1,2} instead, which is the most recent state of db, as Trx2 has already committed. This access method doesn't care about undo logs, and instead tries to lock each row and gap in the scanned range, and report those which match the query and are not delete-marked.
4. Trx2 does DELETE FROM t0 which deletes two rows: c0==-1 and c0==2.
Note: it does not delete the one with c0==1 because it was already deleted by Trx1.
Note: it has no choice than to delete c0==-1 and c0==2, because these are the rows which are (conceptually) in the most recent state of the db, and DELETE must operate on what is there, not on some imaginary "read-view from the past". One can't change the past.
Note: But, this means the db physical state right now is something like this:
c0==-1: delete-marked by Trx2
c0==1 : delete-marked by Trx1
c0==2 : delete-marked by Trx2
5. Trx2 does SELECT * FROM t0. And this is the funny case where it has to combine "state of db at read-view creation" (which was {-1,1}) with "changes made by Trx2 itself" (which was to delete {-1,2}).
So, the only logical result is: show just the row c0==1, as {-1,1} \ {-1,2} == {1}.

Is it surprising to users? Sure.
But this is what it is. And it is somewhat documented in https://dev.mysql.com/doc/refman/9.2/en/innodb-transaction-isolation-levels.html :
>  It is not recommended to mix locking statements (UPDATE, INSERT, DELETE, or SELECT ... FOR ...) with non-locking SELECT statements in a single REPEATABLE READ transaction, because typically in such cases you want SERIALIZABLE. This is because a non-locking SELECT statement presents the state of the database from a read view which consists of transactions committed before the read view was created, and before the current transaction's own writes, while the locking statements use the most recent state of the database to use locking. In general, these two different table states are inconsistent with each other and difficult to parse.

(arguably it could be improved/corrected, as "and before the current transaction's own writes" fragment is wrong/misleading, as the read-view is defined to include trx's own writes. Still the advice is correct)
and in https://dev.mysql.com/doc/refman/9.2/en/innodb-consistent-read.html
> A consistent read means that InnoDB uses multi-versioning to present to a query a snapshot of the database at a point in time. The query sees the changes made by transactions that committed before that point in time, and no changes made by later or uncommitted transactions. The exception to this rule is that the query sees the changes made by earlier statements within the same transaction. This exception causes the following anomaly: If you update some rows in a table, a SELECT sees the latest version of the updated rows, but it might also see older versions of any rows. If other sessions simultaneously update the same table, the anomaly means that you might see the table in a state that never existed in the database. 

My suggestion to everyone using InnoDB:
If you plan to mix reads and writes in one transaction, and expect the results to be interpretable by human as a result of serial execution of transactions, then use SERIALIZABLE!
If you don't mix the two, then REPEATABLE READ is great.