Bug #117553 | Crash caused by race condition between MVCC::view_open and purge view | ||
---|---|---|---|
Submitted: | 24 Feb 9:15 | Modified: | 11 Mar 8:25 |
Reporter: | Yin Peng (OCA) | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
Version: | 8.4.4 | OS: | Any |
Assigned to: | CPU Architecture: | Any | |
Tags: | Contribution |
[24 Feb 9:15]
Yin Peng
[24 Feb 9:16]
Yin Peng
The new readview of purge_sys should not be older than the old one. (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.
Contribution: patch_of_purge_view.txt (text/plain), 2.67 KiB.
[24 Feb 9:22]
Yin Peng
Oh, sorry, in the previous description the state of 'ReadView::m_closed' was incorrectly described. The accurate line should be: 'ReadView::m_closed = false'
[24 Feb 13:10]
MySQL Verification Team
Hello yin peng, Thank you for the report and test case. Verified as described. regards, Umesh
[3 Mar 10:42]
Jakub Lopuszanski
Thank you for your report! I am afraid this issue (and some other subtle variants of it due to missing memory barrier and `m_closed` not being atomic) is known at least since Bug#32679024 PURGE CAN REMOVE TOO MUCH BECAUSE OF MISSING FENCE https://mybug.mysql.oraclecorp.com/orabugs/site/bug.php?id=32679024 (reported in 2021). Actually, the code comment in MVCC::view_open right before the place you modify it acknowledges the problem: ``` void MVCC::view_open(ReadView *&view, trx_t *trx) { ut_ad(!srv_read_only_mode); /** If no new RW transaction has been started since the last view was created then reuse the the existing view. */ if (view != nullptr) { uintptr_t p = reinterpret_cast<uintptr_t>(view); view = reinterpret_cast<ReadView *>(p & ~1); ut_ad(view->m_closed); /* NOTE: This can be optimised further, for now we only reuse the view if there are no active RW transactions. There is an inherent race here between purge and this thread. Purge will skip views that are marked as closed. Therefore we must set the low limit id after we reset the closed status after the check. */ if (trx_is_autocommit_non_locking(trx) && view->empty()) { view->m_closed = false; if (view->m_low_limit_id == trx_sys_get_next_trx_id_or_no()) { return; } else { view->m_closed = true; } } } ``` That code comment is even more ancient, as it was added in 2013. Your proposal, which boils down to simply refusing to let the purge_view travel back in time, is an interesting one. It looks like it should help in those (hopefully) unlikely rare-conditions, without incurring any cost in usual case, nice! What I have to think through, though, is: 1. do we also have to worry about the case where the read-view has the same low_limit_no(), but differs in the active transaction set (m_ids)? There are places in our code in which the purge_sys->view is used to determine changes_visible(..) like in trx_undo_get_undo_rec. If we are going to fix one aspect of "montonicity", maybe we should fix the other too? One way to approach it, would be to change `<` to `<=` in your patch, so that we only use a single read-view for any specific low_limit_no(), this way guaranteeing monotonicity, without spending too much time on comparison logic. But could the copying have performance impact? Could it somehow prevent the purge thread from catching up to clean state in some edgecases and thus blocking a clean shutdown? 2. is there a way to solve it "properly", so that instead of detecting the race-condition and reacting to it, we "simply" eliminate the race-condition in the first place? 3. how to combine this with the ongoing WL#15895 InnoDB: MVCC and ReadView interfaces? 4. how to combine this with the ongoing WL#15991 InnoDB: new MVCC with shared ReadViews? Anyway, looks like the codebase with your patch is better than without it, so I think we can tackle these problems one by one later. Thanks again for the patch!
[4 Mar 2:42]
Yin Peng
Hi Jakub, Thank you for your thoughtful feedback. Here are my thoughts on your questions: 1. I fully agree with changing < to <= in the patch. Regarding performance concerns, the most significant overhead likely stems from memory allocation during ReadView copying. To mitigate this, we could consider reusing a global ReadView object to store the last valid purge view, thereby avoiding repeated allocations. 2. Eliminating race conditions without locks or synchronization primitives is inherently challenging. A promising approach might be adopting the SCN mechanism proposed by Zhai Weixiang in https://bugs.mysql.com/bug.php?id=109599 . 3. I checked the MySQL Worklog but found no public details about WL#15895 and WL#15991.
[5 Mar 16:52]
Jakub Lopuszanski
Hello Yin Peng, Sorry for linking to unpublished WLs. I am afraid they have to stay private, until ready. OTOH I had to mention them for the benefit of our devs reading this discussion, and also to let you know there might be some delay in accepting your patch not because it is wrong in itself, but rather due to conceptual issues of merging it into a bigger narrative of what is currently going on. For the context, it's important to notice few odd facts about current implementation of MVCC: - not every transaction is assigned a trx->no. For example a INSERT-only transaction will be assigned trx->id on start, but no trx->no on commit - assignment of trx->no and trx->id happens from the same atomic, but under two different mutexes (trx_sys->serialization_mutex and trx_sys->mutex respectively), so holding only one of them does not prevent other threads from bumping it - we assing trx->no and add trx to "serialization list" first under trx_sys->serialization_mutex, then (under trx_sys->mutex) remove it from trx_sys->rw_trx_ids, and then (again under trx_sys->serialization_mutex) remove it from serialization list. - one implication of this strange dance is that you can have several read-views with the very same low_limit_no() (which is computed by looking at "serialization list" state), while differing in the list of active transactions, as there's nothing preventing one from starting a new transaction, or even from completely committing INSERT-only transaction, or removing a transaction from rw_trx_ids and so on - another implication of this strange dance is that you can have two read-views which conceptually represent same state of DB, i.e. same state of trxs is considered "already committed" by them, yet they differ in their low_limit_no(), because the removal from serialization list happens late in the commit cycle - while for the most part the most relevant aspect of "purge view" is its low_limit_no(), there are places, such as row_vers_must_preserve_del_marked or trx_undo_get_undo_rec which really care about changes_visible(trx_id,..) queries - these places are quite tricky to analyse and prove correctness, and I will not try to provide full analysis here. Let me just note that row_vers_must_preserve_del_marked needs to reliably determine if at the moment of the call it looked like purge is allowed to remove undo log records created by trx_id or not, and the caller (perhaps counter-intuitively) should remove it if and only if purge could remove it to. I am afraid that non-monotonicity of active trx set part of the purge read-view (even if low_limit_no() stays constant) could lead to situation where ROLLBACK thread physically removes a record, which will become needed in a moment for a re-opened read-view This last point is the worry which lead me to propose replacing < with <= in your patch to force monotonicity of active trx set too. The way it forces it, is a bit non-trivial: low_limit_no() is determined by trx_sys->serialisation_min_trx_no.load(), which in turn is guaranteed to be monotone over time, and that means enforcing < relation on low_limit_no() enforces < relation on time of creation, which in turn means the active trx set will also be monotonic. My main worry about replacing < with <= was that it could somehow make the purge get "stuck" on an old read-view, while there is a newer one, and thus prevent it from purging "everything". But that was silly. What matters for "purging everything" is the low_limit_no(), so if we are trying to perform a clean shutdown and the purge_sys->view.low_limit_no() = X, and X is indeed already equal to trx_sys_get_next_trx_id_or_no(), then that should be enough to purge everything. Even if this purge view had some "active trx ids" in it, this doesn't matter, and shouldn't matter, because one can prove that it can only happen if all of them are INSERT-only, so there's nothing to purge. Your patch, if we change < to <= solves 99% of the problem, I think. There's a remaining 1%, which is the following. Imagine we have a read-view RV1 with m_low_limit_no=17, m_low_limit_id=20, m_up_limit_id=20, m_ids={}. This can happen, for example, if transaction with trx->no==17 was removed from trx_sys->rw_trx_ids, but not yet removed from serialization list. Imagine there's another read-view RV2 with m_low_limit_no=20, m_low_limit_id=20, m_up_limit_id=20, m_ids={}. This can happen a bit later, when trx->no==17 finally is removed from serialization list. Assume that there's no other activity in the system from now on, just SELECTs. You can then open and close RV1 and RV2 multiple times, and AFAICT it will succeed each time, because: view->empty() holds for both of them view->m_low_limit_id == trx_sys_get_next_trx_id_or_no() holds for both of them Yet, one of them has low_limit_no() == 17 and the other has 20. Let's say RV1 was closed and RV2 was open when we were updating purge_sys->view, and thus it became assigned a copy o RV2, which has low_limit_no()==20. There's nothing preventing reopening RV1, which, on paper, has low_limit_no()==17. I am not sure if this can lead to any *real* problem - one can argue that RV1 represents the same state of DB as RV2, and it "doesn't really need" the undo logs of 17, 18 or 19 (no matter if these are ids or nos). Yet, can it lead to some assertion failure, that we have an open RV1, with low_limit_no() < purge_sys->view.low_limit_no() somewhere? I don't know. But also, this is possible already in mysql-trunk with or without your patch, so it is a separate issue, and perhaps purely academic. Meanwhile, I wonder if your patch could be improved, to avoid copying at all, in case the read view is too old. Something like: ``` void MVCC::clone_oldest_view(ReadView *view) { trx_sys_mutex_enter(); - ReadView *oldest_view = get_oldest_view(); + ReadView * oldest_view; + for (oldest_view = UT_LIST_GET_LAST(m_views); + oldest_view != nullptr; + oldest_view = UT_LIST_GET_PREV(m_view_list, oldest_view)) { + if (!view->is_closed()) { + if(view->low_limit_no() <= view->low_limit_no()){ + // do not update the view + trx_sys_mutex_exit(); + return; + } + break; + } + } if (oldest_view == nullptr) { view->prepare(0); trx_sys_mutex_exit(); } else { view->copy_prepare(*oldest_view); trx_sys_mutex_exit(); view->copy_complete(); } /* Update view to block purging transaction till GTID is persisted. */ auto >id_persistor = clone_sys->get_gtid_persistor(); auto gtid_oldest_trxno = gtid_persistor.get_oldest_trx_no(); view->reduce_low_limit(gtid_oldest_trxno); } ``` Note: don't worry about gtid_persistor thingy - this should be, and will be handled separately.
[5 Mar 16:54]
Jakub Lopuszanski
sorry, I've meant - if(view->low_limit_no() <= view->low_limit_no()){ + if(oldest_view->low_limit_no() <= view->low_limit_no()){
[11 Mar 8:25]
Yin Peng
Hello Jakub, Thank you for the detailed explanation and the enhanced patch.