MySQL Bugs: #117553: Crash caused by race condition between MVCC::view

Bug #117553	Crash caused by race condition between MVCC::view_open and purge view
Submitted:	24 Feb 9:15	Modified:	11 Mar 8:25
Reporter:	Yin Peng (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	8.4.4	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Contribution

Description:
A critical crash was observed in our environment. The failure occurred during transaction rollback operations, triggered by an assertion failure in ut_dbg_assertion_failed with the condition next_undo_offset > undo_offset. Call stack:

ut_dbg_assertion_failed (expr=expr@entry=0x1a29ce8 "next_undo_offset > undo_offset")
trx_undo_rec_copy 
trx_undo_get_undo_rec_low 
trx_undo_get_undo_rec out
trx_undo_prev_version_build
row_vers_old_has_index_entry
row_undo_mod_del_mark_or_remove_sec_low
row_undo_mod_del_mark_or_remove_sec
row_undo_mod_upd_exist_sec
row_undo_mod(undo_node_t*, que_thr_t*)
row_undo
row_undo_step(que_thr_t*)
que_thr_step (thr=<optimized out>)
que_run_threads_low 
que_run_threads(que_thr_t*)
trx_rollback_to_savepoint_low(trx_t*, trx_savept_t*)
trx_rollback_for_mysql_low (trx=0x7f34a194d270)
trx_rollback_low(trx_t*)
trx_rollback_for_mysql(trx_t*)
innobase_rollback(handlerton*, THD*, bool)
ha_rollback_low

After analysing the coredump I found that the low_limit_no of purge_sys->view is less than the trx_no of purge_sys->iter. This indicates that some rollback operations may read the purged undo logs! After diving into the code, I think the root cause is that there is a race condition between MVCC::view_open and MVCC::clone_oldest_view:

thd1                                   thd2                           purge
simple select
readview:
{ 
  m_low_limit_id = 2475083946,
  m_up_limit_id = 2475083946,
  m_low_limit_no = 2475083946
}

                                      dml and commit

                                                                    purge all undo logs
                                                                    purge_sys->view:
                                                                    { 
                                                                      m_low_limit_id = 2475254583,
                                                                      m_up_limit_id = 2475254583,
                                                                      m_low_limit_no = 2475254583
                                                                    }
                                                                    purge_sys->iter.trx_no: 2475254583
                                              

simple select
check if readview can be reused:
{ 
  m_low_limit_id = 2475083946,
  m_up_limit_id = 2475083946,
  m_low_limit_no = 2475083946
}
ReadView::m_closed = true

                                              
                                                                    MVCC::clone_oldest_view
                                                                    purge_sys->view:
                                                                    { 
                                                                      m_low_limit_id = 2475083946,
                                                                      m_up_limit_id = 2475083946,
                                                                      m_low_limit_no = 2475083946
                                                                    }

                                      dml and rollback       
                                    may read the stale undo

How to repeat:
First we need to add some debug points:

diff --git a/storage/innobase/read/read0read.cc b/storage/innobase/read/read0read.cc
index a918b11b06f..6ee7ad8b1d1 100644
--- a/storage/innobase/read/read0read.cc
+++ b/storage/innobase/read/read0read.cc
@@ -36,6 +36,8 @@ this program; if not, write to the Free Software Foundation, Inc.,
 
 #include "srv0srv.h"
 #include "trx0sys.h"
+#include "current_thd.h"
+#include <debug_sync.h>
 
 /*
 -------------------------------------------------------------------------------
@@ -520,6 +522,11 @@ void MVCC::view_open(ReadView *&view, trx_t *trx) {
 
     if (trx_is_autocommit_non_locking(trx) && view->empty()) {
       view->m_closed = false;
+#ifdef UNIV_DEBUG
+      if (current_thd) {
+        DEBUG_SYNC(current_thd, "simulate_stop_after_setting_m_closed");
+      }
+#endif
 
       if (view->m_low_limit_id == trx_sys_get_next_trx_id_or_no()) {
         return;
diff --git a/storage/innobase/trx/trx0purge.cc b/storage/innobase/trx/trx0purge.cc
index cbbac9eb59f..15fc8250ce7 100644
--- a/storage/innobase/trx/trx0purge.cc
+++ b/storage/innobase/trx/trx0purge.cc
@@ -2407,6 +2407,15 @@ ulint trx_purge(ulint n_purge_threads, /*!< in: number of purge tasks
 
   rw_lock_x_unlock(&purge_sys->latch);
 
+  DBUG_EXECUTE_IF("simulate_stop_after_clone_oldest_view", {
+    bool end = false;
+    while (!end) {
+      std::this_thread::sleep_for(std::chrono::seconds(1));
+      ib::info() << "[TXSQL] purge thread stopped after clone oldest view";
+      DBUG_EXECUTE_IF("continue_to_purge", { end = true; });
+    }
+  });
+
 #ifdef UNIV_DEBUG
   if (srv_purge_view_update_only_debug) {
     return (0);

Then run the following test script and you will see the server has gone away:

source include/have_debug.inc;
source include/have_debug_sync.inc;

--connect(con1,localhost,root)

create table t1(id int not null primary key auto_increment, c1 int, c2 int, key k_c1(c1));

insert into t1(c1,c2) values(1,1),(2,2);
sleep 2;
connection con1;
select * from t1;

connection default;
update t1 set c2=c2+1;
update t1 set c1=c1+2;

connection default;
set global innodb_purge_stop_now=ON;
set global innodb_purge_run_now=ON;
--source include/wait_innodb_all_purged.inc

CREATE UNDO TABLESPACE undo_test_001 ADD DATAFILE 'undo_test_001.ibu';
ALTER UNDO TABLESPACE innodb_undo_001 SET INACTIVE;
--let $grep_file = $MYSQLTEST_VARDIR/log/mysqld.1.err
--let $grep_pattern = .*Completed truncate of undo tablespace innodb_undo_001.*
--source include/wait_for_pattern_in_file.inc
ALTER UNDO TABLESPACE innodb_undo_001 SET ACTIVE;

ALTER UNDO TABLESPACE innodb_undo_002 SET INACTIVE;
--let $grep_file = $MYSQLTEST_VARDIR/log/mysqld.1.err
--let $grep_pattern = .*Completed truncate of undo tablespace innodb_undo_002.*
--source include/wait_for_pattern_in_file.inc
ALTER UNDO TABLESPACE innodb_undo_002 SET ACTIVE;

ALTER UNDO TABLESPACE undo_test_001 SET INACTIVE;
--let $grep_file = $MYSQLTEST_VARDIR/log/mysqld.1.err
--let $grep_pattern = .*Completed truncate of undo tablespace undo_test_001.*
--source include/wait_for_pattern_in_file.inc
DROP UNDO TABLESPACE undo_test_001;

connection con1;
set debug_sync="simulate_stop_after_setting_m_closed signal clone_view_stopped wait_for clone_view_continue";
send select * from t1;

connection default;
set debug_sync="now wait_for clone_view_stopped";
set global debug='+d, simulate_stop_after_clone_oldest_view';
update t1 set c2=c2 +1;
--let $grep_file = $MYSQLTEST_VARDIR/log/mysqld.1.err
--let $grep_pattern = .*purge thread stopped after clone oldest.*
--source include/wait_for_pattern_in_file.inc

begin;
update t1 set c1=c1+2;
update t1 set c2 = c2 + 2;
rollback;

set debug_sync='now signal clone_view_continue';

connection con1;
reap;
connection default;
disconnect con1;

set global debug='-d, simulate_stop_after_clone_oldest_view';
set global debug='+d,continue_to_purge';
sleep 5;
set global debug='-d,continue_to_purge';
drop table t1;

Suggested fix:
The new readview of purge_sys should not be older than the old one.

The new readview of purge_sys should not be older than the old one. (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: patch_of_purge_view.txt (text/plain), 2.67 KiB.

Oh, sorry, in the previous description the state of 'ReadView::m_closed' was incorrectly described. The accurate line should be:
'ReadView::m_closed = false'

Hello yin peng,

Thank you for the report and test case.
Verified as described.

regards,
Umesh

Thank you for your report!

I am afraid this issue (and some other subtle variants of it due to missing memory barrier and `m_closed` not being atomic) is known at least since
Bug#32679024 PURGE CAN REMOVE TOO MUCH BECAUSE OF MISSING FENCE
https://mybug.mysql.oraclecorp.com/orabugs/site/bug.php?id=32679024
(reported in 2021).
Actually, the code comment in MVCC::view_open right before the place you modify it acknowledges the problem:
```
void MVCC::view_open(ReadView *&view, trx_t *trx) {
  ut_ad(!srv_read_only_mode);

  /** If no new RW transaction has been started since the last view
  was created then reuse the the existing view. */
  if (view != nullptr) {
    uintptr_t p = reinterpret_cast<uintptr_t>(view);

    view = reinterpret_cast<ReadView *>(p & ~1);

    ut_ad(view->m_closed);

    /* NOTE: This can be optimised further, for now we only
    reuse the view if there are no active RW transactions.

    There is an inherent race here between purge and this
    thread. Purge will skip views that are marked as closed.
    Therefore we must set the low limit id after we reset the
    closed status after the check. */

    if (trx_is_autocommit_non_locking(trx) && view->empty()) {
      view->m_closed = false;

      if (view->m_low_limit_id == trx_sys_get_next_trx_id_or_no()) {
        return;
      } else {
        view->m_closed = true;
      }
    }
  }
```
That code comment is even more ancient, as it was added in 2013.

Your proposal, which boils down to simply refusing to let the purge_view travel back in time, is an interesting one.
It looks like it should help in those (hopefully) unlikely rare-conditions, without incurring any cost in usual case, nice!
What I have to think through, though, is:
1. do we also have to worry about the case where the read-view has the same low_limit_no(), but differs in the active transaction set (m_ids)? There are places in our code in which the purge_sys->view is used to determine changes_visible(..) like in trx_undo_get_undo_rec. If we are going to fix one aspect of "montonicity", maybe we should fix the other too? One way to approach it, would be to change `<` to `<=` in your patch, so that we only use a single read-view for any specific low_limit_no(), this way guaranteeing monotonicity, without spending too much time on comparison logic. But could the copying have performance impact? Could it somehow prevent the purge thread from catching up to clean state in some edgecases and thus blocking a clean shutdown?
2. is there a way to solve it "properly", so that instead of detecting the race-condition and reacting to it, we "simply" eliminate the race-condition in the first place?
3. how to combine this with the ongoing WL#15895 InnoDB: MVCC and ReadView interfaces?
4. how to combine this with the ongoing WL#15991 InnoDB: new MVCC with shared ReadViews?

Anyway, looks like the codebase with your patch is better than without it, so I think we can tackle these problems one by one later.
Thanks again for the patch!

Hi Jakub,

Thank you for your thoughtful feedback. Here are my thoughts on your questions:

1. I fully agree with changing < to <= in the patch. Regarding performance concerns, the most significant overhead likely stems from memory allocation during ReadView copying. To mitigate this, we could consider reusing a global ReadView object to store the last valid purge view, thereby avoiding repeated allocations.

2. Eliminating race conditions without locks or synchronization primitives is inherently challenging. A promising approach might be adopting the SCN mechanism proposed by Zhai Weixiang in https://bugs.mysql.com/bug.php?id=109599 .

3. I checked the MySQL Worklog but found no public details about WL#15895 and WL#15991.

Hello Yin Peng,
Sorry for linking to unpublished WLs. I am afraid they have to stay private, until ready. OTOH I had to mention them for the benefit of our devs reading this discussion, and also to let you know there might be some delay in accepting your patch not because it is wrong in itself, but rather due to conceptual issues of merging it into a bigger narrative of what is currently going on.

For the context, it's important to notice few odd facts about current implementation of MVCC:
- not every transaction is assigned a trx->no. For example a INSERT-only transaction will be assigned trx->id on start, but no trx->no on commit
- assignment of trx->no and trx->id happens from the same atomic, but under two different mutexes (trx_sys->serialization_mutex and trx_sys->mutex respectively), so holding only one of them does not prevent other threads from bumping it
- we assing trx->no and add trx to "serialization list" first under trx_sys->serialization_mutex, then (under trx_sys->mutex) remove it from trx_sys->rw_trx_ids, and then (again under trx_sys->serialization_mutex) remove it from serialization list. 
- one implication of this strange dance is that you can have several read-views with the very same low_limit_no() (which is computed by looking at "serialization list" state), while differing in the list of active transactions, as there's nothing preventing one from starting a new transaction, or even from completely committing INSERT-only transaction, or removing a transaction from rw_trx_ids and so on
- another implication of this strange dance is that you can have two read-views which conceptually represent same state of DB, i.e. same state of trxs is considered "already committed" by them, yet they differ in their low_limit_no(), because the removal from serialization list happens late in the commit cycle
- while for the most part the most relevant aspect of "purge view" is its low_limit_no(), there are places, such as row_vers_must_preserve_del_marked or trx_undo_get_undo_rec which really care about changes_visible(trx_id,..) queries - these places are quite tricky to analyse and prove correctness, and I will not try to provide full analysis here. Let me just note that row_vers_must_preserve_del_marked needs to reliably determine if at the moment of the call it looked like purge is allowed to remove undo log records created by trx_id or not, and the caller (perhaps counter-intuitively) should remove it if and only if purge could remove it to. I am afraid that non-monotonicity of active trx set part of the purge read-view (even if low_limit_no() stays constant) could lead to situation where ROLLBACK thread physically removes a record, which will become needed in a moment for a re-opened read-view

This last point is the worry which lead me to propose replacing < with <= in your patch to force monotonicity of active trx set too.
The way it forces it, is a bit non-trivial: low_limit_no() is determined by trx_sys->serialisation_min_trx_no.load(), which in turn is guaranteed to be monotone over time, and that means enforcing < relation on low_limit_no() enforces < relation on time of creation, which in turn means the active trx set will also be monotonic.
My main worry about replacing < with <= was that it could somehow make the purge get "stuck" on an old read-view, while there is a newer one, and thus prevent it from purging "everything".
But that was silly. What matters for "purging everything" is the low_limit_no(), so if we are trying to perform a clean shutdown and the purge_sys->view.low_limit_no() = X, and X is indeed already equal to trx_sys_get_next_trx_id_or_no(), then that should be enough to purge everything.
Even if this purge view had some "active trx ids" in it, this doesn't matter, and shouldn't matter, because one can prove that it can only happen if all of them are INSERT-only, so there's nothing to purge.

Your patch, if we change < to <= solves 99% of the problem, I think.
There's a remaining 1%, which is the following.
Imagine we have a read-view RV1 with m_low_limit_no=17,  m_low_limit_id=20, m_up_limit_id=20, m_ids={}. This can happen, for example, if transaction with trx->no==17 was removed from trx_sys->rw_trx_ids, but not yet removed from serialization list.
Imagine there's another read-view RV2 with m_low_limit_no=20,  m_low_limit_id=20, m_up_limit_id=20, m_ids={}. This can happen a bit later, when trx->no==17 finally is removed from serialization list.
Assume that there's no other activity in the system from now on, just SELECTs.
You can then open and close RV1 and RV2 multiple times, and AFAICT it will succeed each time, because:
view->empty() holds for both of them
view->m_low_limit_id  == trx_sys_get_next_trx_id_or_no() holds for both of them
Yet, one of them has low_limit_no() == 17 and the other has 20.
Let's say RV1 was closed and RV2 was open when we were updating purge_sys->view, and thus it became assigned a copy o RV2, which has low_limit_no()==20.
There's nothing preventing reopening RV1, which, on paper, has low_limit_no()==17.
I am not sure if this can lead to any *real* problem - one can argue that RV1 represents the same state of DB as RV2, and it "doesn't really need" the undo logs of 17, 18 or 19 (no matter if these are ids or nos).
Yet, can it lead to some assertion failure, that we have an open RV1, with low_limit_no() < purge_sys->view.low_limit_no() somewhere?
I don't know.
But also, this is possible already in mysql-trunk with or without your patch, so it is a separate issue, and perhaps purely academic.

Meanwhile, I wonder if your patch could be improved, to avoid copying at all, in case the read view is too old.
Something like:
```
void MVCC::clone_oldest_view(ReadView *view) {
  trx_sys_mutex_enter();

- ReadView *oldest_view = get_oldest_view();
+ ReadView * oldest_view;
+ for (oldest_view = UT_LIST_GET_LAST(m_views); 
+      oldest_view != nullptr;
+      oldest_view = UT_LIST_GET_PREV(m_view_list, oldest_view)) {
+   if (!view->is_closed()) {
+     if(view->low_limit_no() <= view->low_limit_no()){
+        // do not update the view
+        trx_sys_mutex_exit();
+        return;
+     }
+     break;
+   }
+ }
  if (oldest_view == nullptr) {
    view->prepare(0);

    trx_sys_mutex_exit();

  } else {
    view->copy_prepare(*oldest_view);

    trx_sys_mutex_exit();

    view->copy_complete();
  }
  /* Update view to block purging transaction till GTID is persisted. */
  auto &gtid_persistor = clone_sys->get_gtid_persistor();
  auto gtid_oldest_trxno = gtid_persistor.get_oldest_trx_no();
  view->reduce_low_limit(gtid_oldest_trxno);
}
```
Note: don't worry about gtid_persistor thingy - this should be, and will be handled separately.

sorry, I've meant
-     if(view->low_limit_no() <= view->low_limit_no()){
+     if(oldest_view->low_limit_no() <= view->low_limit_no()){

Hello  Jakub,

Thank you for the detailed explanation and the enhanced patch.