MySQL Bugs: #113468: The lock for unmatched record should be released but not in READ-COMMITTED

Bug #113468	The lock for unmatched record should be released but not in READ-COMMITTED
Submitted:	20 Dec 2023 4:07	Modified:	20 Dec 2023 13:52
Reporter:	shengchun cao	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	lock

Description:
Transaction isolation: READ-COMMITTED.

A transaction (called trx 1) want to get lock on a record (called R1), but another transaction (trx 2) is holding the lock.
After trx 2 released the lock, trx 1 got the lock.
When flow of execution returned to server layer, the record R1 was filtered by where conditions.
And then, the lock on record R1 should be released, but it's not released.

How to repeat:
step 1, Create table:
CREATE TABLE `t2` (
  `id` int unsigned NOT NULL AUTO_INCREMENT,
  `i1` int DEFAULT '0',
  `i2` int DEFAULT '0',
  PRIMARY KEY (`id`) USING BTREE,
  KEY `idx_i1` (`i1`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb3;

Insert records:
INSERT INTO `t2` (`id`, `i1`, `i2`) VALUES (1, 11, 21),(2, 12, 22),(3, 13, 23),(4, 14, 24),(5, 15, 25),(6, 16, 26);

step 2, Login to MySQL:
After first login, we got a session, which called [session 1].
After second login, we got another session, which called [session 2].

step 3, Execute SQLs in [session 1]:
begin;
select * from t2 where i1 = 15 for update

step 4, Execute SQLs in [session 2]:
begin;
select * from t2 where i1 = 15 and i2 = 20 for share;

After execute select SQL, [session 2] has blocked to wait for lock on record i1 = 15.

step 5, Rollback in [session 1]:
rollback

After rollback, [session 2] has got lock on record i1 = 15.

step 6, We resue [session 1] to get locks in MySQL, execute SQL:
select * from performance_schema.data_locks;

The result is:
+--------+-------------------------------+-----------------------+-----------+----------+---------------+-------------+----------------+-------------------+------------+-----------------------+-----------+---------------+-------------+-----------+
| ENGINE | ENGINE_LOCK_ID                | ENGINE_TRANSACTION_ID | THREAD_ID | EVENT_ID | OBJECT_SCHEMA | OBJECT_NAME | PARTITION_NAME | SUBPARTITION_NAME | INDEX_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_MODE     | LOCK_STATUS | LOCK_DATA |
+--------+-------------------------------+-----------------------+-----------+----------+---------------+-------------+----------------+-------------------+------------+-----------------------+-----------+---------------+-------------+-----------+
| INNODB | 5284467592:1340:4568673472    | 281480261178248       | 52        | 53       | test          | t2          | <null>         | <null>            | <null>     | 4568673472            | TABLE     | IS            | GRANTED     | <null>    |
| INNODB | 5284467592:166:5:6:5142269984 | 281480261178248       | 52        | 53       | test          | t2          | <null>         | <null>            | idx_i1     | 5142269984            | RECORD    | S,REC_NOT_GAP | GRANTED     | 15, 5     |
+--------+-------------------------------+-----------------------+-----------+----------+---------------+-------------+----------------+-------------------+------------+-----------------------+-----------+---------------+-------------+-----------+

Tips, the records of t2 as follows:
+----+----+----+
| id | i1 | i2 |
+----+----+----+
| 1  | 11 | 21 |
| 2  | 12 | 22 |
| 3  | 13 | 23 |
| 4  | 14 | 24 |
| 5  | 15 | 25 |
| 6  | 16 | 26 |
+----+----+----+

We can see that: 
There isn't a record for i1 = 15 and i2 = 20.
InnoDB locks the record i1 = 15, server layer will filter the record the lock.
However trx of [session 2] holds the lock on record i1 = 15 still.

As a contrast, we will execute the same SQLs with [session 2] in [session 3].

step 7, Login to MySQL:
After login, we got the third session which called [session 3].

step 8, Execute SQLs in [session 3]:
begin;
select * from t2 where i1 = 15 and i2 = 20 for share;

The trx of [session 3] gets the lock immediately.

step 9, We resue [session 1] to get locks in MySQL, execute SQL:
select * from performance_schema.data_locks;

The result is:
+--------+-------------------------------+-----------------------+-----------+----------+---------------+-------------+----------------+-------------------+------------+-----------------------+-----------+---------------+-------------+-----------+
| ENGINE | ENGINE_LOCK_ID                | ENGINE_TRANSACTION_ID | THREAD_ID | EVENT_ID | OBJECT_SCHEMA | OBJECT_NAME | PARTITION_NAME | SUBPARTITION_NAME | INDEX_NAME | OBJECT_INSTANCE_BEGIN | LOCK_TYPE | LOCK_MODE     | LOCK_STATUS | LOCK_DATA |
+--------+-------------------------------+-----------------------+-----------+----------+---------------+-------------+----------------+-------------------+------------+-----------------------+-----------+---------------+-------------+-----------+
| INNODB | 5284467592:1340:4568673472    | 281480261178248       | 52        | 53       | test          | t2          | <null>         | <null>            | <null>     | 4568673472            | TABLE     | IS            | GRANTED     | <null>    |
| INNODB | 5284467592:166:5:6:5142269984 | 281480261178248       | 52        | 53       | test          | t2          | <null>         | <null>            | idx_i1     | 5142269984            | RECORD    | S,REC_NOT_GAP | GRANTED     | 15, 5     |
| INNODB | 5284468584:1340:4568675520    | 281480261179240       | 58        | 6        | test          | t2          | <null>         | <null>            | <null>     | 4568675520            | TABLE     | IS            | GRANTED     | <null>    |
+--------+-------------------------------+-----------------------+-----------+----------+---------------+-------------+----------------+-------------------+------------+-----------------------+-----------+---------------+-------------+-----------+

We can see that: the lock on record i1 = 15 for trx of [session 3] has been released.

Suggested fix:
Code of row_search_mvcc(), line 5199 ~ 5212 as follows:
    case DB_SUCCESS_LOCKED_REC:
        if (trx->releases_non_matching_rows()) {
          /* Note that a record of
          prebuilt->index was locked. */
          ut_ad(!prebuilt->new_rec_lock[row_prebuilt_t::LOCK_PCUR]);
          prebuilt->new_rec_lock[row_prebuilt_t::LOCK_PCUR] = true;
        }
        err = DB_SUCCESS;
        [[fallthrough]];
      case DB_SUCCESS:
        if (row_to_range_relation.row_must_be_at_end) {
          prebuilt->m_stop_tuple_found = true;
        }
        break;

Modified like this:
    case DB_SUCCESS_LOCKED_REC:
        err = DB_SUCCESS;
        [[fallthrough]];
      case DB_SUCCESS:
        if (trx->releases_non_matching_rows()) {
          /* Note that a record of
          prebuilt->index was locked. */
          ut_ad(!prebuilt->new_rec_lock[row_prebuilt_t::LOCK_PCUR]);
          prebuilt->new_rec_lock[row_prebuilt_t::LOCK_PCUR] = true;
        }
        if (row_to_range_relation.row_must_be_at_end) {
          prebuilt->m_stop_tuple_found = true;
        }
        break;

Hi Mr. cao,

Thank you very much for your bug report.

We were able to repeat it with 8.0.35.

This is now a verified bug in 8.0 and 8.2.

Hi, nice analysis!
While I see how the patch you proposed solves your particular use case, it is unfortunately introducing a severe correctness bug.
The reason why we distinguish DB_SUCCESS_LOCKED_REC from DB_SUCCESS is because we need to be sure that a new lock request was granted to satisfy this particular read of the row from this particular scan as opposed to a situation where the current transaction was already granted this lock request earlier, say because of some earlier statement.
Consider a transaction which is similar to yours, but additionally does a select for *existing* row before asking for *non-exiting* row:
```
  set transaction isolation level read committed;
  BEGIN;
  select * from t2 where i1 = 15 and i2 = 25 for share;
  select * from t2 where i1 = 15 and i2 = 20 for share;
```
In such case, the first SELECT might obtain the S lock on <i1=15,id=5>, and then the later SELECT will release it!
That would be a bug, because the transaction should keep the S lock on the row until it commits, as it was needed for the first SELECT.

For this fix to be correct, I think we would have to somehow ensure that it only applies to the case where previously we had DB_LOCK_WAIT.
So, one tempting thing to do would be to add something like:
```
      case DB_LOCK_WAIT:
        if (trx->releases_non_matching_rows()) {
          prebuilt->new_rec_lock[row_prebuilt_t::LOCK_PCUR] = true;
        }
```
however, please note that this case, as of today, has a comment warning devs to not do that:
```
      case DB_LOCK_WAIT:
        /* Lock wait for R-tree should already
        be handled in sel_set_rtr_rec_lock() */
        ut_ad(!dict_index_is_spatial(index));
        /* Never unlock rows that were part of a conflict. */
        prebuilt->new_rec_lock.reset();
        ut_a(!use_semi_consistent);
        goto lock_wait_or_error;
```
It even tries to clear prebuilt->new_rec_lock.

This comment was added 13 years ago in a fix for 
"Bug#53674: InnoDB: Error: unlock row could not find a 4 mode lock on the record"
I don't know what was the justification for this particular rule.
I can't find any discussion of it neither in the bug:
https://bugs.mysql.com/bug.php?id=53674
nor in the review board:
https://rb.mysql.oraclecorp.com/rb/r/368/
What I do know, though, is that the code in this area was buggy, and confused. And for DB_LOCK_WAIT it used to check for ROW_READ_TRY_SEMI_CONSISTENT flag and in such case it released the lock manually, in particular this is how the code looked like in 2022 before commit d59ea3342252:
```
      case DB_LOCK_WAIT:
        /* Lock wait for R-tree should already
        be handled in sel_set_rtr_rec_lock() */
        ut_ad(!dict_index_is_spatial(index));
        /* Never unlock rows that were part of a conflict. */
        prebuilt->new_rec_lock.reset();

        if (UNIV_LIKELY(prebuilt->row_read_type !=
                        ROW_READ_TRY_SEMI_CONSISTENT) ||
            unique_search || index != clust_index) {
          goto lock_wait_or_error;
        }
        ut_a(trx->allow_semi_consistent());
        /* The following call returns 'offsets' associated with 'old_vers' */
        row_sel_build_committed_vers_for_mysql(
            clust_index, prebuilt, rec, &offsets, &heap, &old_vers,
            need_vrow ? &vrow : nullptr, &mtr);

        /* Check whether it was a deadlock or not, if not
        a deadlock and the transaction had to wait then
        release the lock it is waiting on. */

        DEBUG_SYNC_C("semi_consistent_read_would_wait");
        err = lock_trx_handle_wait(trx);

        switch (err) {
          case DB_SUCCESS:
            /* The lock was granted while we were
            searching for the last committed version.
            Do a normal locking read. */

            offsets = rec_get_offsets(rec, index, offsets, ULINT_UNDEFINED,
                                      UT_LOCATION_HERE, &heap);
            goto locks_ok;
          case DB_DEADLOCK:
            goto lock_wait_or_error;
          case DB_LOCK_WAIT:
            ut_ad(!dict_index_is_spatial(index));
            err = DB_SUCCESS;
            break;
          default:
            ut_error;
        }

        if (old_vers == nullptr) {
          /* The row was not yet committed */

          goto next_rec;
        }

        did_semi_consistent_read = true;
        rec = old_vers;
        prev_rec = rec;
        ut_d(prev_rec_debug = row_search_debug_copy_rec_order_prefix(
                 pcur, index, prev_rec, &prev_rec_debug_n_fields,
                 &prev_rec_debug_buf, &prev_rec_debug_buf_size));
        break;
```
In particular note the call to lock_trx_handle_wait(trx) which basically cancels the waiting lock.
So, perhaps we cleared the new_rec_lock array, because we knew we will remove the lock by using lock_trx_handle_wait(trx)?
But then in d59ea3342252 I've simplified this, to simply use SELECT_SKIP_LOCKED, so that the wait never really happens for semi-consistent reads.
So, perhaps, today, we could remove the:
-        /* Never unlock rows that were part of a conflict. */
-        prebuilt->new_rec_lock.reset();
?
I don't know. 
This would require a quite long investigation. The area of semi-consistent read, and this related (but separate) mechanism of releasing locks is a very complicated and it is very easy to introduce a bug here.
Also, note that in your scenario the use_semi_consistent evaluates to false (because row_read_type is not ROW_READ_TRY_SEMI_CONSISTENT, and also because index is notclust_index).
For example, note this cautionary comment:
```
    if (!same_user_rec && trx->releases_non_matching_rows()) {
      /* Since we were not able to restore the cursor
      on the same user record, we cannot use
      row_prebuilt_t::try_unlock() to unlock any records, and
      we must thus reset the new rec lock info. Since
      in lock0lock.cc we have blocked the inheriting of gap
      X-locks, we actually do not have any new record locks
      set in this case.

      Note that if we were able to restore on the 'same'
      user record, it is still possible that we were actually
      waiting on a delete-marked record, and meanwhile
      it was removed by purge and inserted again by some
      other user. But that is no problem, because in
      rec_loop we will again try to set a lock, and
      new_rec_lock_info in trx will be right at the end. */

      prebuilt->new_rec_lock.reset();
    }
```
This suggests that one has to take care of issues like "lock inheritance", delete-marked records, etc.
The way it is problematic is that by the time the transaction thread wakes up, it might turn out that the lock it got acquired is actually on a delete-marked record, or moved to gap, or removed by purge, or something (I am not saying that any of this options are *really* possible, only that in general they need to be considered and tideously proven not to be a problem)

All I can say now, is that your patch is wrong, and figuring out a correct patch is very difficult.

Thank you, Jacub .......