MySQL Bugs: #117498: Traversing a relatively long undo-chain will seriously affect write performance

Bug #117498	Traversing a relatively long undo-chain will seriously affect write performance
Submitted:	18 Feb 9:37	Modified:	27 Mar 13:36
Reporter:	Ruyi Zhang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	8.0.41	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
In our recent tests, we've found that if a large number of undo records pile up in the system, the write performance will be significantly affected when queries traverse these historical records via outdated views (or when the purge system verifies whether the secondary index records marked as deleted are still valid).

In my opinion, an overly long undo chain will have a direct impact on query performance, but there is at least no reason for it to have a significant impact on writing either.

And I understand that one of the core functions of the MVCC mechanism is to resolve read - write conflicts. However, in this scenario, "reading too many historical records" will block writes to the entire page for an extended period.

How to repeat:
Run the following write load.

```
/usr/bin/sysbench oltp_update_non_index --mysql-socket=/tmp/mysql.sock  --mysql-user=root --mysql-db=test --tables=1 --table-size=5 --report-interval=1 --percentile=99 --max-time=3000000 --threads=32 --rand-type=uniform prepare/run
```

After that, run the following read load.
0 % cat long-undo-chain-scan.lua

```
#!/usr/bin/env sysbench
-- Copyright (C) 2006-2017 Alexey Kopytov <akopytov@gmail.com>

-- This program is free software; you can redistribute it and/or modify
-- it under the terms of the GNU General Public License as published by
-- the Free Software Foundation; either version 2 of the License, or
-- (at your option) any later version.

-- This program is distributed in the hope that it will be useful,
-- but WITHOUT ANY WARRANTY; without even the implied warranty of
-- MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
-- GNU General Public License for more details.

-- You should have received a copy of the GNU General Public License
-- along with this program; if not, write to the Free Software
-- Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA

-- ----------------------------------------------------------------------
-- Insert-Only OLTP benchmark
-- ----------------------------------------------------------------------

require("oltp_common")

sysbench.cmdline.commands.prepare = {
   function ()
      if (not sysbench.opt.auto_inc) then
         -- Create empty tables on prepare when --auto-inc is off, since IDs
         -- generated on prepare may collide later with values generated by
         -- sysbench.rand.unique()
         sysbench.opt.table_size=0
      end

      cmd_prepare()
   end,
   sysbench.cmdline.PARALLEL_COMMAND
}

function prepare_statements()
   -- We do not use prepared statements here, but oltp_common.sh expects this
   -- function to be defined
end

function event()
   local table_name = "sbtest" .. sysbench.rand.uniform(1, sysbench.opt.tables)

   con:query("SET TRANSACTION ISOLATION LEVEL REPEATABLE READ;")
   con:query("begin")
   con:query(string.format("select sum(id) from %s force index(primary)", table_name))
   con:query("select sleep(30)") --some long query executed
   for i=1,100 do
         con:query(string.format("select sum(id) from %s force index(primary)", table_name))
   end
   con:query("commit")

end
```

```
/usr/bin/sysbench long-undo-chain-scan --mysql-socket=/tmp/mysql.sock  --mysql-user=root --mysql-db=test --tables=1 --table-size=1000 --report-interval=1 --percentile=99 --max-time=3000000 --threads=1 --range_size=1000 --rand-type=uniform run
```

In long-undo-chain-scan.lua, a slow query lasting up to 30 seconds is simulated to cause the accumulation of undo records. 

Subsequently, when we scan the records through the read-view from 30 seconds ago, the write performance will significantly decline.

[ 263s ] thds: 32 tps: 12965.81 qps: 12965.81 (r/w/o: 0.00/12965.81/0.00) lat (ms,99%): 9.73 err/s: 0.00 reconn/s: 0.00
[ 264s ] thds: 32 tps: 13027.91 qps: 13027.91 (r/w/o: 0.00/13027.91/0.00) lat (ms,99%): 9.56 err/s: 0.00 reconn/s: 0.00
[ 265s ] thds: 32 tps: 13153.21 qps: 13153.21 (r/w/o: 0.00/13153.21/0.00) lat (ms,99%): 9.06 err/s: 0.00 reconn/s: 0.00
[ 266s ] thds: 32 tps: 13153.60 qps: 13153.60 (r/w/o: 0.00/13153.60/0.00) lat (ms,99%): 9.39 err/s: 0.00 reconn/s: 0.00
[ 267s ] thds: 32 tps: 13206.86 qps: 13206.86 (r/w/o: 0.00/13206.86/0.00) lat (ms,99%): 8.90 err/s: 0.00 reconn/s: 0.00
[ 268s ] thds: 32 tps: 12636.26 qps: 12636.26 (r/w/o: 0.00/12636.26/0.00) lat (ms,99%): 10.09 err/s: 0.00 reconn/s: 0.00
[ 269s ] thds: 32 tps: 5750.25 qps: 5750.25 (r/w/o: 0.00/5750.25/0.00) lat (ms,99%): 10.09 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 32 tps: 15.00 qps: 15.00 (r/w/o: 0.00/15.00/0.00) lat (ms,99%): 1427.08 err/s: 0.00 reconn/s: 0.00
[ 271s ] thds: 32 tps: 10.00 qps: 10.00 (r/w/o: 0.00/10.00/0.00) lat (ms,99%): 2405.65 err/s: 0.00 reconn/s: 0.00
[ 272s ] thds: 32 tps: 9.00 qps: 9.00 (r/w/o: 0.00/9.00/0.00) lat (ms,99%): 3448.53 err/s: 0.00 reconn/s: 0.00
[ 273s ] thds: 32 tps: 13.00 qps: 13.00 (r/w/o: 0.00/13.00/0.00) lat (ms,99%): 4517.90 err/s: 0.00 reconn/s: 0.00
[ 274s ] thds: 32 tps: 8.00 qps: 8.00 (r/w/o: 0.00/8.00/0.00) lat (ms,99%): 4437.27 err/s: 0.00 reconn/s: 0.00
[ 275s ] thds: 32 tps: 14.00 qps: 14.00 (r/w/o: 0.00/14.00/0.00) lat (ms,99%): 4943.53 err/s: 0.00 reconn/s: 0.00
[ 276s ] thds: 32 tps: 12.00 qps: 12.00 (r/w/o: 0.00/12.00/0.00) lat (ms,99%): 4683.57 err/s: 0.00 reconn/s: 0.00
[ 277s ] thds: 32 tps: 6.00 qps: 6.00 (r/w/o: 0.00/6.00/0.00) lat (ms,99%): 4055.23 err/s: 0.00 reconn/s: 0.00
[ 278s ] thds: 32 tps: 12.00 qps: 12.00 (r/w/o: 0.00/12.00/0.00) lat (ms,99%): 4768.67 err/s: 0.00 reconn/s: 0.00

Suggested fix:
When MySQL traverses an excessively long undo-chain, the clustered index page and the secondary index page (if involved) will be locked for a long time. 

Writes will be blocked because they cannot immediately obtain the page latch. 

There are multiple paths for MySQL to traverse the undo chain. 

Let's take "accessing historical records through the secondary index" as an example:

Secondary index leaf -> Clustered index leaf -> repeat[1 .. n] undo-page

As can be seen, MySQL always holds the latches of the secondary index leaf and the clustered index leaf during the traversal of the undo chain.

In my opinion, it is unnecessary to hold the latches all the time.

For the clustered index leaf, when we access the undo-rec generated by the second transaction (a committed transaction), the structure of the undo-chain from this undo-rec to the visible version will no longer change. We can release the latch of the clustered index leaf and continue to traverse the undo chain safely.

For the secondary index leaf, there also seems to be no concurrency issue when releasing its latch after traversing too many undo-recs. However, the prerequisite is that we need to call pcur->store_position and mtr.commit() before releasing the latch. And we should restore the cursor after finishing traversing the undo chain.

I wrote a patch to verify the improvement in write performance when the latch is released earlier in this scenario. The verification shows that traversing the undo chain no longer has a significant impact on writes.

diff --git a/storage/innobase/include/row0vers.h b/storage/innobase/include/row0vers.h
index 5ec2e7d0ac9..d0510fef330 100644
--- a/storage/innobase/include/row0vers.h
+++ b/storage/innobase/include/row0vers.h
@@ -47,6 +47,19 @@ this program; if not, write to the Free Software Foundation, Inc.,
 // Forward declaration
 class ReadView;

+struct row_vers_traverse_strategy_t {
+  enum class strategy_type : uint8_t {
+    kLatchStrategy,
+    kNoLatchStrategy,
+  };
+  strategy_type type{strategy_type::kLatchStrategy};
+};
+
+struct row_vers_no_latch_strategy_t : row_vers_traverse_strategy_t {
+  row_vers_no_latch_strategy_t() { type = strategy_type::kNoLatchStrategy; }
+  std::function<void()> release_latch{nullptr};
+};
+
 /** Finds out if an active transaction has inserted or modified a secondary
  index record.
  @param[in]   rec       record in a secondary index
@@ -115,7 +128,8 @@ bool row_vers_old_has_index_entry(
 dberr_t row_vers_build_for_consistent_read(
     const rec_t *rec, mtr_t *mtr, dict_index_t *index, ulint **offsets,
     ReadView *view, mem_heap_t **offset_heap, mem_heap_t *in_heap,
-    rec_t **old_vers, const dtuple_t **vrow, lob::undo_vers_t *lob_undo);
+    rec_t **old_vers, const dtuple_t **vrow, lob::undo_vers_t *lob_undo,
+    row_vers_traverse_strategy_t *traverse_strategy = nullptr);

 /** Constructs the last committed version of a clustered index record,
  which should be seen by a semi-consistent read.
diff --git a/storage/innobase/row/row0sel.cc b/storage/innobase/row/row0sel.cc
index 0c4c7fc86c2..c7ea5cc6c3c 100644
--- a/storage/innobase/row/row0sel.cc
+++ b/storage/innobase/row/row0sel.cc
@@ -3072,7 +3072,8 @@ bool row_sel_store_mysql_rec(byte *mysql_rec, row_prebuilt_t *prebuilt,
     ReadView *read_view, dict_index_t *clust_index, row_prebuilt_t *prebuilt,
     const rec_t *rec, ulint **offsets, mem_heap_t **offset_heap,
     rec_t **old_vers, const dtuple_t **vrow, mtr_t *mtr,
-    lob::undo_vers_t *lob_undo) {
+    lob::undo_vers_t *lob_undo,
+    row_vers_traverse_strategy_t *traverse_strategy = nullptr) {
   DBUG_TRACE;

   dberr_t err;
@@ -3085,7 +3086,7 @@ bool row_sel_store_mysql_rec(byte *mysql_rec, row_prebuilt_t *prebuilt,

   err = row_vers_build_for_consistent_read(
       rec, mtr, clust_index, offsets, read_view, offset_heap,
-      prebuilt->old_vers_heap, old_vers, vrow, lob_undo);
+      prebuilt->old_vers_heap, old_vers, vrow, lob_undo, traverse_strategy);

   return err;
 }
@@ -4454,6 +4455,7 @@ dberr_t row_search_mvcc(byte *buf, page_cur_mode_t mode,
   dberr_t err = DB_SUCCESS;
   bool unique_search = false;
   bool mtr_has_extra_clust_latch = false;
+  bool pcur_need_restore_position = false;
   bool moves_up = false;
   bool set_also_gap_locks = true;
   /* if the query is a plain locking SELECT, and the isolation level
@@ -5331,11 +5333,23 @@ rec_loop:
           !lock_clust_rec_cons_read_sees(rec, index, offsets,
                                          trx_get_read_view(trx))) {
         rec_t *old_vers;
+        row_vers_no_latch_strategy_t traverse_strategy{};
+        traverse_strategy.release_latch = [&] {
+          pcur_need_restore_position = true;
+          pcur->store_position(&mtr);
+          mtr.commit();
+          ut_a(pcur->m_rel_pos == BTR_PCUR_ON);
+        };
         /* The following call returns 'offsets' associated with 'old_vers' */
         err = row_sel_build_prev_vers_for_mysql(
             trx->read_view, clust_index, prebuilt, rec, &offsets, &heap,
             &old_vers, need_vrow ? &vrow : nullptr, &mtr,
-            prebuilt->get_lob_undo());
+            prebuilt->get_lob_undo(), &traverse_strategy);
+
+        if (pcur_need_restore_position) {
+          ut_a(!mtr.is_active());
+          traverse_strategy.release_latch = nullptr;
+        }

         if (err != DB_SUCCESS) {
           goto lock_wait_or_error;
@@ -5825,7 +5839,8 @@ next_rec:
   For R-tree spatial search, we also commit the mini-transaction
   each time  */

-  if (mtr_has_extra_clust_latch || spatial_search) {
+  if (mtr_has_extra_clust_latch || spatial_search ||
+      pcur_need_restore_position) {
     /* If we have extra cluster latch, we must commit
     mtr if we are moving to the next non-clustered
     index record, because we could break the latching
@@ -5835,12 +5850,15 @@ next_rec:
     bool is_pcur_rec = (pcur->get_rec() == prev_rec);

     /* No need to do store restore for R-tree */
-    if (!spatial_search) {
+    if (!spatial_search && !pcur_need_restore_position) {
       pcur->store_position(&mtr);
     }

-    mtr_commit(&mtr);
+    if (!pcur_need_restore_position) {
+      mtr_commit(&mtr);
+    }
     mtr_has_extra_clust_latch = false;
+    pcur_need_restore_position = false;

     DEBUG_SYNC_C("row_search_before_mtr_restart_for_extra_clust");

@@ -5933,6 +5951,7 @@ lock_wait_or_error:
 lock_table_wait:
   mtr_commit(&mtr);
   mtr_has_extra_clust_latch = false;
+  pcur_need_restore_position = false;

   trx->error_state = err;

diff --git a/storage/innobase/row/row0vers.cc b/storage/innobase/row/row0vers.cc
index 1573601d488..ba8c480bd82 100644
--- a/storage/innobase/row/row0vers.cc
+++ b/storage/innobase/row/row0vers.cc
@@ -1250,7 +1250,8 @@ bool row_vers_old_has_index_entry(
 dberr_t row_vers_build_for_consistent_read(
     const rec_t *rec, mtr_t *mtr, dict_index_t *index, ulint **offsets,
     ReadView *view, mem_heap_t **offset_heap, mem_heap_t *in_heap,
-    rec_t **old_vers, const dtuple_t **vrow, lob::undo_vers_t *lob_undo) {
+    rec_t **old_vers, const dtuple_t **vrow, lob::undo_vers_t *lob_undo,
+    row_vers_traverse_strategy_t *traverse_strategy) {
   DBUG_TRACE;
   const rec_t *version;
   rec_t *prev_version;
@@ -1279,6 +1280,8 @@ dberr_t row_vers_build_for_consistent_read(

   version = rec;

+  uint64_t n_rec_traversed = 0;
+
   for (;;) {
     mem_heap_t *prev_heap = heap;

@@ -1335,6 +1338,15 @@ dberr_t row_vers_build_for_consistent_read(
     }

     version = prev_version;
+
+    n_rec_traversed++;
+    if (n_rec_traversed == 10 && traverse_strategy != nullptr &&
+        traverse_strategy->type ==
+            row_vers_traverse_strategy_t::strategy_type::kNoLatchStrategy) {
+      auto const no_latch_strategy =
+          (row_vers_no_latch_strategy_t *)traverse_strategy;
+      no_latch_strategy->release_latch();
+    }
   }

   mem_heap_free(heap);

Hello Ruyi Zhang,

Thank you for the report and feedback.
Verified as described.

regards,
Umesh