MySQL Bugs: #115107: Optimization in row_purge_poss_sec Function for Undo Purge Process

Bug #115107	Optimization in row_purge_poss_sec Function for Undo Purge Process
Submitted:	23 May 2024 18:54	Modified:	17 Oct 2024 10:01
Reporter:	Zongzhi Chen (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S5 (Performance)
Version:	5.7, 8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	innodb, performance

Description:
During the undo purge process, in the function `row_purge_poss_sec()`, as per the comment:

> "Determines if it is possible to remove a secondary index entry.  
> Removal is possible if the secondary index entry does not refer to any not delete marked version of a clustered index record where DB_TRX_ID is newer than the purge view."

If the purge view is newer than the secondary index page's `max_trx_id`, then it shouldn't need to build the previous version because the secondary index entry's `DB_TRX_ID` is older than the page's `max_trx_id`. If the purge view is newer than the page's `max_trx_id`, it is also absolutely larger than the secondary index entry's `DB_TRX_ID`.

This logic is similar to what `lock_sec_rec_cons_read_sees()` does in `row_search_mvcc()`.

**Code Diff:**

```diff
-  if (row_purge_poss_sec(node, index, entry)) {
+  if (purge_sys->view.changes_visible(page_get_max_trx_id(page_align(btr_cur_get_rec(pcur.get_btr_cur()))),
+        index->table->name) || row_purge_poss_sec(node, index, entry)) {
     /* Remove the index record, which should have been
     marked for deletion. */
     if (!rec_get_deleted_flag(btr_cur_get_rec(btr_cur),
@@ -499,7 +500,8 @@ if possible.
     case ROW_FOUND:
       /* Before attempting to purge a record, check
       if it is safe to do so. */
-      if (row_purge_poss_sec(node, index, entry)) {
+      if (purge_sys->view.changes_visible(page_get_max_trx_id(page_align(btr_cur_get_rec(pcur.get_btr_cur()))),
+            index->table->name) || row_purge_poss_sec(node, index, entry)) {
         btr_cur_t *btr_cur = pcur.get_btr_cur();

         /* Only delete-marked records should be purged. */
```

**Test Case Results:**

In a simple `oltp_update_index` test case:

- **Before Optimization:**
  - `Innodb_trx_history_list_len`: 34,873,334
  - Time taken: 1,311 seconds
  - Average: 26,600 undo logs per second

- **After Optimization:**
  - `Innodb_trx_history_list_len`: 32,273,298
  - Time taken: 1,030 seconds
  - Average: 31,333 undo logs per second

This shows about an 18% improvement in performance.

I believe that if there is an older purge view blocking the purge, the optimization would be even more significant.

How to repeat:
1. Use a transaction (trx) to block the undo purge.
2. Run the `sysbench oltp_update_index` test case.
3. Measure the purge performance before applying the optimization.
4. Apply the code changes as shown in the diff.
5. Run the `sysbench oltp_update_index` test case again.
6. Measure the purge performance after applying the optimization.
7. Compare the results to see the improvement.

Suggested fix:
the code
```
-  if (row_purge_poss_sec(node, index, entry)) {
+  if (purge_sys->view.changes_visible(page_get_max_trx_id(page_align(btr_cur_get_rec(pcur.get_btr_cur()))),
+        index->table->name) || row_purge_poss_sec(node, index, entry)) {
     /* Remove the index record, which should have been
     marked for deletion. */
     if (!rec_get_deleted_flag(btr_cur_get_rec(btr_cur),
@@ -499,7 +500,8 @@ if possible.
     case ROW_FOUND:
       /* Before attempting to purge a record, check
       if it is safe to do so. */
-      if (row_purge_poss_sec(node, index, entry)) {
+      if (purge_sys->view.changes_visible(page_get_max_trx_id(page_align(btr_cur_get_rec(pcur.get_btr_cur()))),
+            index->table->name) || row_purge_poss_sec(node, index, entry)) {
         btr_cur_t *btr_cur = pcur.get_btr_cur();

         /* Only delete-marked records should be purged. */

```

Hi Mr. Chen,

Thank you for your bug report.

This is now a verified performance improvement bug in 8.0 and higher.

Yet another optimization can add to the code.

```
  switch (search_result) {
    case ROW_FOUND:
      /* Before attempting to purge a record, check
      if it is safe to do so. */
      if (row_purge_poss_sec(node, index, entry))) {
        btr_cur_t *btr_cur = pcur.get_btr_cur();

        /* Only delete-marked records should be purged. */
        if (!rec_get_deleted_flag(btr_cur_get_rec(btr_cur),
                                  dict_table_is_comp(index->table))) {
          ib::error(ER_IB_MSG_1008)
              << "tried to purge non-delete-marked"
                 " record"
                 " in index "
              << index->name << " of table " << index->table->name
              << ": tuple: " << *entry << ", record: "
              << rec_index_print(btr_cur_get_rec(btr_cur), index);
          pcur.close();
          ut_d(ut_error);
          ut_o(goto func_exit_no_pcur);
        }
```

as the comment said, only delete-marked record should be purged. However, this record may be non-delete-marked, such as in this case:

```
insert into t values(1, 2);
update t set c2 = 3 where c1 = 1;
update t set c2 = 4 where c1 = 1;
update t set c2 = 3 where c1 = 1;																			
```

Then the secondary index record is looks like this:
(2, 1, delete-marked), (3, 1, non-delete-marked)(4, 1, delete-marked)

when purging the update undo of this sql "update t set c2 = 4 where c1 = 1;"
the secondary record (3, 1) is non-delete-marked.

If the secondary record is non-delete-marked, the code go into the ib:error(ER_IB_MSG_1008) if this function row_purge_poss_sec() not return false. However, in function row_purge_poss_sec() doesn't check whether the record is delete of not, how could row_purge_poss_sec() know it should return false if the secondary record is non-delete-marked?

True be told, the code is really hard to understand.
Here's my best understanding of what this code is doing.
if the secondary record is delete-marked, then row_purge_poss_sec() => row_vers_old_has_index_entry() need to build previous version to check if the secondary index entry does not refer to any not delete marked version of a clustered index record where DB_TRX_ID is newer than the purge view. if not then return true.
if the secondary record is non-delete-marked, in row_vers_old_has_index_entry() function, the secondary index record is naturally equal to the cluster index record. then this function return true.

Then the result is that if the secondary record is delete marked and refer to non-delete-marked version of a clustered index or the record is non-delete-marked, row_vers_old_has_index_entry() return true, and row_purge_poss_sec() will return false.
Then the code will not go into the error path.

However, calling function row_purge_poss_sec() is a bit expensive. In function row_purge_poss_sec(), it need call row_purge_reposition_pcur() to find the cluster index record that relate to this secondary record, this function need to travel the btree and calling btr_cur_search_to_nth_level() to find it. And then calling row_vers_old_has_index_entry() to find 

So adding a simple check can avoid the consumption of calling function row_purge_poss_sec()
rec_get_deleted_flag(btr_cur_get_rec(pcur.get_btr_cur()), dict_table_is_comp(index->table))

Combine the optimization of purge view filter, the code looks like:

```
diff --git a/storage/innobase/row/row0purge.cc b/storage/innobase/row/row0purge.cc
index 1d22e6b6328..379d07fdb1d 100644
--- a/storage/innobase/row/row0purge.cc
+++ b/storage/innobase/row/row0purge.cc
@@ -371,7 +371,9 @@ index tree.  Does not try to buffer the delete.
   which cannot be purged yet, requires its existence. If some requires,
   we should do nothing. */

-  if (row_purge_poss_sec(node, index, entry)) {
+  if (rec_get_deleted_flag(btr_cur_get_rec(pcur.get_btr_cur()), dict_table_is_comp(index->table)) &&
+      purge_sys->view.changes_visible(page_get_max_trx_id(page_align(btr_cur_get_rec(pcur.get_btr_cur()))),
+        index->table->name) || row_purge_poss_sec(node, index, entry)) {
     /* Remove the index record, which should have been
     marked for deletion. */
     if (!rec_get_deleted_flag(btr_cur_get_rec(btr_cur),
@@ -499,7 +501,10 @@ if possible.
     case ROW_FOUND:
       /* Before attempting to purge a record, check
       if it is safe to do so. */
-      if (row_purge_poss_sec(node, index, entry)) {
+      if (rec_get_deleted_flag(btr_cur_get_rec(pcur.get_btr_cur()),
+            dict_table_is_comp(index->table)) &&
+          (purge_sys->view.changes_visible(page_get_max_trx_id(page_align(btr_cur_get_rec(pcur.get_btr_cur()))),
+            index->table->name) || row_purge_poss_sec(node, index, entry))) {
         btr_cur_t *btr_cur = pcur.get_btr_cur();

         /* Only delete-marked records should be purged. */
```

in my test case, oltp_read_write with 100 table, the qps improve from 224969 to 247828, and the purge speed also increased.

Thank you Mr. Chen, for your new contribution.

Impressive analysis!
I think the later version of the patch in which you first check if the secondary index record is delete marked at all is probably correct.
This made me wonder, why wasn't it written in the way you propose from the beginning?
After all, as you've said, it's much cheaper to test for delete-mark first.

I can see that before 2014 the code was first calling row_purge_poss_sec, and if it said we should remove the record, then 
- we had a debug assert that it is not delete marked
- but in release mode we would gladly remove a non-delete marked error
The fix for Bug#18631496 : MULTIPLE PURGE THREADS - CORRUPTION - ERROR IN SEC INDEX ENTRY UPDATE
changed that to use an if:
```
        if (row_purge_poss_sec(node, index, entry)) {
                /* Remove the index record, which should have been
                marked for deletion. */
-               ut_ad(REC_INFO_DELETED_FLAG
-                     & rec_get_info_bits(btr_cur_get_rec(btr_cur),
-                                         dict_table_is_comp(index->table)));
+               if (!rec_get_deleted_flag(btr_cur_get_rec(btr_cur),
+                                         dict_table_is_comp(index->table))) {
+                       fputs("InnoDB: tried to purge sec index entry not"
+                             " marked for deletion in\n"
+                             "InnoDB: ", stderr);
+                       dict_index_name_print(stderr, NULL, index);
+                       fputs("\n"
+                             "InnoDB: tuple ", stderr);
+                       dtuple_print(stderr, entry);
+                       fputs("\n"
+                             "InnoDB: record ", stderr);
+                       rec_print(stderr, btr_cur_get_rec(btr_cur), index);
+                       putc('\n', stderr);
+
+                       ut_ad(0);
+
+                       goto func_exit;
+               }

                btr_cur_pessimistic_delete(&err, FALSE, btr_cur, 0,
                                           RB_NONE, &mtr);
```
so it is no longer causing a crash, nor corruption.
The commit offers no test case, and there's none in the RB. The bug report has some non-deterministic test cases.
But, it seems that nobody ever understood clearly how exactly it can happen that `row_purge_poss_sec` results `true` for a record which is not delete marked.
Perhaps this is the reason we've decided to add an error message, so that if it happens again we can try to investigate?
So, if we apply your proposal, then it will *hide* the problem, which we have never solved, which sounds a bit bad.
Thus, I think, we should first try to use some of the non-deterministic tests that are listed in
 Bug#18631496 : MULTIPLE PURGE THREADS - CORRUPTION - ERROR IN SEC INDEX ENTRY UPDATE
 Bug 19138298 - RECORD IN INDEX WAS NOT FOUND ON ROLLBACK, TRYING TO INSERT
 Bug#18652315 	ABORT IN PURGE, ERROR IN SEC INDEX ENTRY DEL UNDO, FK 
and see if any of them can produce the error message.
Then we would could try to find the root cause.
Then fix it.
Then I think we can use your patch.

Alternative approach would be to do something like:
```
if(rec_get_deleted_flag(btr_cur_get_rec(pcur.get_btr_cur()), dict_table_is_comp(index->table)){ 
  if(purge_sys->view.changes_visible(page_get_max_trx_id(page_align(btr_cur_get_rec(pcur.get_btr_cur())))) ||
     row_purge_poss_sec(node, index, entry)){
     ....delete the row
  }
}else{
  // in debug mode assert that we wouldn't try to purge
  ut_ad(!row_purge_poss_sec(node, index, entry));
}
```
so that we could get best performance in release mode, while also keeping an eye on the problem in debug mode, so that if the problem appears in our CI we will know.

One more thing to verify in your patch is the issue of purge_sys->latch which needs to be acquired at least in S mode, to access purge_sys->view.
If it is not already acquired by this thread (I think it is not) then we should make sure we acquire it for the shortest possible duration and don't violate latching order (`./mtr --mysqld=--innodb_sync_debug=1`)

Anyway, great idea, thanks!

Thank you, Jakub.

Posted by developer:

Even though the patch for this bug, inspired by this contribution, got approved in code review, the final testing revealed one big problem with this approach.

Turns out, that in some situations InnoDB is not updating the page_max_trx_id of a secondary index page, even though the page *is* modified.
It is difficult to identify the whole scope of this problem, i.e. to compactly describe under what circumstances it can happen - this is because the logic of bumping page_max_trx_id is not tightly coupled with the logic of modifying secondary index pages, so one has to tediously look at all the places which modify a secondary page, and try to find a call to method which updates page_max_trx_id somewhere around it.
Needless to say, even if one did such analysis, it would be quite error prone, and not future proof - there's nothing preventing a developer to add a new code which modifies a secondary index page or remove a code which bumps page_max_trx_id, as the two are pretty much independent actions.
"The reason" why they are independent, seems to be that "conceptually" page_max_trx_id is a part of locking system, as opposed to a part of "B-tree page modification system". In other words, it was meant as a heuristic to quickly determine if an implicit lock is being held on a secondary index record without having to perform an expensive dive to primary key and undo log chain.
This in turn means, that places which do not care about locking at all in our code base, often do not care about page_max_trx_id neither.

One such place, thankfully caught in testing, is mysql.innodb_ddl_log table which tracks "a todo list of cleanup steps needed if DDL operation fails".
This table *is* undo logged. Thus we do run purge for it. However, modifications to this table are done using BTR_NO_LOCKING_FLAG flag (see DDL_Log_Table::insert and DDL_Log_Table::remove). When this flag is in use the locking system is not involved at all and/or some of the lock system calls return immediately without bumping page_max_trx_id.

There are many other places which use BTR_NO_LOCKING_FLAG.

One may wonder how this can be correct? After all page_max_trx_id seems to be used not only for determining implicit locks, but also as a heuristic in MVCC scans, to avoid dives to primary key.
I think the answer must be that it is because we use BTR_NO_LOCKING_FLAG only for tables which are never accessed in parallel from more than one transaction - which is perhaps proved case-by-case by referring to various properties ("temporary tables are seen just by 1 session") or mutexes ("persistent table metadata is being modified under protection of a global mutex"). Honestly, I don't know the full list of use cases, nor have I seen a comprehensive proof of correctness.

Anyway, the bottom line is, that means the part of the heuristic which tries to rely on page_max_trx_id to infer that no (undo logged) modifications might err on the unsafe side.

This contribution also contains another heuristic: looking at the delete mark flag. I think, this part is fine. Perhaps we will push just this part.

One way to take advantage of `page_max_trx_id` heuristic, would be to force any operation on secondary index page to bump it. I think this shouldn't be a big problem for performance - the page in question is already in BP and X-latched, and the redo log record to describe it is just a few bytes. However this looks like a rather big change in code, and thus will have to wait.

Thank you Jakub for this detailed analysis.

First of all, I acknowledge that the optimization of the delete mark flag should indeed provide some performance improvement, and I believe the other aspects should also work as expected. If there is an mtr failure, we can investigate which mtr failed to pinpoint the issue.

However, I have a question. Currently, in row_search_mvcc(), the function lock_sec_rec_cons_read_sees() uses max_trx_id for visibility filtering, which is similar to the proposed logic here. If this logic has an issue, then row_search_mvcc() would encounter the same problem.

So, I investigated why row_search_mvcc() can use max_trx_id for visibility filtering without encountering issues.

After grepping through the code, I summarized that the use of BTR_NO_LOCKING_FLAG mainly appears in the following four scenarios:

1. In DDL operations, since the pages involved in the DDL process are not accessed by other transactions, the BTR_NO_LOCKING_FLAG can be used when inserting data.
2. During transaction rollback, since the corresponding record lock has already been acquired during the normal execution phase, there is no need to reacquire the record lock when rolling back the insertion. Thus, BTR_NO_LOCKING_FLAG can be used.
3. In B-tree update operations, specifically in the function row_upd_sec_index_entry_low(), since the record lock is already acquired during the search phase, there is no need to acquire the record lock again during the update phase. Therefore, BTR_NO_LOCKING_FLAG can be used during the actual update.
4. In the case of temporary tables, since only the current statement accesses the table, no locking is needed, and BTR_NO_LOCKING_FLAG can be used.

I understand that in these scenarios, BTR_NO_LOCKING_FLAG is used mainly because the page won’t be accessed by other transactions, or because the record lock has already been acquired earlier, so there’s no need to acquire it again.

In theory, the same issues could arise in row_search_mvcc() because all these scenarios prevent concurrent access to the table. If this is the case, I believe that undo purge could follow a similar logic. If there are concerns, we could assign a very large default value to the max_trx_id field when generating a new page in the B-tree, which should resolve the issue.

Hi Mr. Chen,

Our Development team is the only one who can approve or disapprove a fix.

Posted by developer:
 
Today, I've realized one more problem, when reading this fragment from ibuf_build_entry_from_ibuf_rec_func(..) in ibuf0ibuf.cc:
```
 /* Prevent an ut_ad() failure in page_zip_write_rec() by
 adding system columns to the dummy table pointed to by the
 dummy secondary index.  The insert buffer is only used for
 secondary indexes, whose records never contain any system
 columns, such as DB_TRX_ID. */
```
The innodb change buffer (aka insert buffer or IBUF) is a mechanism which lets us delay operations on secondary indexes, by stashing the todo items into a single global b-tree (the ibuf tree in the system tablespace). These todo items are executed lazily, and they do not bother to remember what was the trx id associated with them, so when they update secondary index page, they can't bump the page_max_trx_id even if they wanted.

Posted by developer:
 
Looks like I've missed another mechanism at play in ibuf_merge_or_delete_for_page(..) :
      max_trx_id = page_get_max_trx_id(page_align(rec));
      page_update_max_trx_id(block, page_zip, max_trx_id, &mtr);

where rec is a record from ibuf, and block is a page from the secondary index which we have just read in from disc, and need to apply the change described in rec.
As we can see, we *do* bump the page_max_trx_id, but instead of using the real id of the transaction which made the original change, we "err on the safe side", by using maximum from the whole ibuf page which contained the rec.

Hello everyone,

I have some thoughts on why using page_max_trx_id in the row_search_mvcc path is safe.

My understanding is that, as has been discussed, the page_max_trx_id stored on a secondary index page is not always updated to reflect the maximum transaction ID of all records currently residing on that page (for example, in scenarios involving the BTR_NO_LOCKING_FLAG or updates applied via the Change Buffer). However, it is also true that the stored page_max_trx_id is a lower bound, meaning stored_page_max_trx_id <= real_page_max_trx_id.

In the context of row_search_mvcc, the function lock_sec_rec_cons_read_sees() is used as a heuristic. All the callers follow the pattern:
```
if (lock_sec_rec_cons_read_sees() == false){
  // fast path
} else {
  // slow path
  // access clustered record to get an accurate mvcc state.
}
```

'if (!lock_sec_rec_cons_read_sees)' equals to 'if (read_view.up_limit_id <= stored_page_max_trx_id)'. Given that stored_page_max_trx_id <= real_page_max_trx_id, it also equals to 'if (read_view.up_limit_id <= real_page_max_trx_id)'
so the logic 'if (!lock_sec_rec_cons_read_sees) { do something fast }' is fine.

In contrast, logic like this:
```
if (lock_sec_rec_cons_read_sees == true){ // equals to if (read_view.up_limit_id > stored_page_max_trx_id)
  // fast path
} else {
  // slow path
}
```
might be incorrect. This is because, as I stated, there is only a less-than-or-equal-to partial order relationship between stored_page_max_trx_id and real_page_max_trx_id, and based only on the condition read_view.up_limit_id > stored_page_max_trx_id, it's impossible to definitively determine the relationship between read_view.up_limit_id and real_page_max_trx_id. A decision made in the "fast path" of this structure (the if block) might be based on a misleadingly low stored_page_max_trx_id, potentially leading to an incorrect action if there is no subsequent full check to correct it.

Hello, Yichang SONG
> In the context of row_search_mvcc, the function lock_sec_rec_cons_read_sees() is used as a heuristic. All the callers follow the pattern:
> ```
> if (lock_sec_rec_cons_read_sees() == false){
>   // fast path
> } else {
>   // slow path
>   // access clustered record to get an accurate mvcc state.
> }
> ```

I am surprised by this, as I thought the pattern is the opposite.
For example, this is how I interpret this fragment of code:

```
    } else if (!srv_read_only_mode &&
               !lock_sec_rec_cons_read_sees(rec, index, node->read_view)) {
      cons_read_requires_clust_rec = true;
    }
```

The lock_sec_rec_cons_read_sees(rec, ,view) checks if the read view can see effects of all the transactions with id <= [PAGE_MAX_TRX_ID].
The idea is, that if it can, and [PAGE_MAX_TRX_ID] is larger or equal to any trx id which modified a row mentioned on the page, then we will not have to construct an older version of the row, so don't need to consult clustered index.
AFAICT, this mechanism only works correctly if [PAGE_MAX_TRX_ID] is indeed larger than any trx id which modified the page.
For consider a case, where [PAGE_MAX_TRX_ID]==10, and trx id 20 has modified the secondary page, but hasn't bumped the [PAGE_MAX_TRX_ID], and then another transaction which has a readview which is meant to see "everything up to and including trx id 12 and also 17" would process the record which was generated by trx id 20:
- the lock_sec_rec_cons_read_sees() would read [PAGE_MAX_TRX_ID]==10 from the page
- the lock_sec_rec_cons_read_sees() would read the 12 from the view as the highest number up two which everything below is seen
- it would return true, as 10 <= 12, i.e. it would report that this view should see everything mentioned on the secondary page as is
But that would be wrong, right?

Am I missing something?

(Btw, note my later correction to the Change Buffer topic - it looks like it works correctly, although the value it stores in [PAGE_MAX_TRX_ID] might be larger than needed)

Hi Jakub,
Yes, you are right, I was totally wrong, my bad:(. And yes, neither real_max_trx_id >= stored_max_trx_id nor real_max_trx_id <= stored_max_trx_id could be guaranteed.

Thanks for confirming my understanding - that's very useful for me to have independent validation of my reasoning, as this topic is very complicated.

Also, nice to see there are still people with courage to admit mistake in the internet.

As for this bug report, I think IBUF/Change Buffer is not a problem, because while being a bit pessimistic (using a value larger than need, erring on the safe site) it still makes sure the page's max trx id is a correct upper bound, which is sufficient for this contribution.

The problem, then, is all the places which simply don't bump it at all, because they do not use row-level locking, and do not expect anyone else to use MVCC, because they believe for this or that reason, that the access to the table is exclusive or something.

The code-reviewed variant of the patch looks like this:
```
static bool row_purge_poss_sec_fast(btr_cur_t *btr_cur, purge_node_t *node,
                                    dict_index_t *index,
                                    const dtuple_t *entry) {
  /* We know from a separate reasoning that row_purge_poss_sec() should only
  return true for delete-marked records. It is much cheaper to check this flag
  on the record in hand, than it is to perform a dive into clustered index and
  undo log chain traversal. So, we start with this simple check: */
  auto rec = btr_cur_get_rec(btr_cur);
  if (!rec_get_deleted_flag(rec, dict_table_is_comp(index->table))) {
    /* The fast path returns here, without checking what the slow path would do.
    In the past we had various bugs in the logic of slow path which resulted in
    deciding to delete the record which wasn't even delete marked. To guard
    against this kind of a regression, in debug mode we still check that
    row_purge_poss_sec()==true implies rec_get_deleted_flag()==true, i.e.
    rec_get_deleted_flag()==false implies row_purge_poss_sec()==false. */
    ut_ad(!row_purge_poss_sec(node, index, entry));
    return false;
  }
  /* We know from a separate reasoning, that a secondary index record is no
  longer needed, if there is no active read view which still sees this record as
  not deleted. Also, the record is currently delete-marked, which means that if
  it was ever non-delete-marked, it had to be before the latest modification to
  this page. We also know, that purge_sys->view is constructed as a copy of the
  oldest active read view. If this view can see effects of a given transaction
  id, then all other (fresher) read-views will also see the effect of it.
  Further, we know that the highest trx id which did any changes to this
  secondary index page is page_get_max_trx_id(page_align(rec)). Combining all
  this together we get following observation: if the purge_sys->view can see
  effects of all transactions with ids smaller or equal to
  page_get_max_trx_id(page_align(rec)), then this record can be removed. */
  const trx_id_t page_max_trx_id = page_get_max_trx_id(page_align(rec));
  rw_lock_s_lock(&purge_sys->latch, UT_LOCATION_HERE);
  const bool can_remove = purge_sys->view.sees(page_max_trx_id);
  rw_lock_s_unlock(&purge_sys->latch);
  if (can_remove) {
    /* We've already checked the row is delete marked, which eliminates some
    class of bugs. But, we could err in the other direction: perhaps the fast
    path is too eager to clean up rows, which the slow path would say are still
    needed? Thus, we check the implication that
    row_purge_poss_sec()==false implies we don't remove the record, i.e.
    we remove the record implies row_purge_poss_sec()==true.
    There is a subtle edge case though: when InnoDB INSERTs a row to a table,
    it first inserts it to the primary index, and only then to secondary
    indexes, where each of these is a separate mtr and thus releases latches on
    pages in between. So, there exists an LSN at which the physical state of
    the table is such that the primary index already has the non-delete-marked
    record, but the secondary index was not yet updated, so might still contain
    delete-marked record and PAGE_MAX_TRX_ID lower than the active transaction
    performing the INSERT. At this moment, row_purge_poss_sec(..) would declare
    the secondary index record as needed, because it would already see the
    matching record in the primary index. Yet, our heuristic, would see
    PAGE_MAX_TRX_ID to be still low enough for the purge_sys->view to see all
    changes up to and including it. It is still safe to physically remove the
    secondary record in this case, because the transaction performing INSERT
    will re-insert the secondary index record as soon as it gets its chance to
    latch the page after us. */
#ifdef UNIV_DEBUG
    if (!row_purge_poss_sec(node, index, entry)) {
      /* See explanation above - this is only possible if the latest version of
      primary index record was inserted by a currently active transaction, which
      is in the middle of INSERT. To confirm that explanation, we check that the
      most recent version of the record is non-delete-marked and was modified by
      a transaction which is currently active, and that the PAGE_MAX_TRX_ID is
      less than its id, as it hasn't yet touched the secondary index. Finally,
      we assert that none of the earlier versions of the primary index record
      matches the secondary index record, by passing also_curr=false to
      row_vers_old_has_index_entry(). */
      mtr_t mtr;
      mtr_start(&mtr);
      const bool found_clust =
          row_purge_reposition_pcur(BTR_SEARCH_LEAF, node, &mtr);
      ut_a(found_clust);
      const rec_t *clust_rec = node->pcur.get_rec();
      const bool comp = page_rec_is_comp(clust_rec);
      ut_a(!rec_get_deleted_flag(clust_rec, comp));
      const dict_index_t *clust_index = node->pcur.index();
      ut_a(clust_index->is_clustered());
      Rec_offsets rec_offsets{};
      const trx_id_t trx_id = row_get_rec_trx_id(
          clust_rec, clust_index, rec_offsets.compute(clust_rec, clust_index));
      ut_a_lt(page_max_trx_id, trx_id);
      ut_a(trx_rw_is_active(trx_id, false));
      ut_a(!row_vers_old_has_index_entry(false, clust_rec, &mtr, index, entry,
                                         node->roll_ptr, node->trx_id));
      node->pcur.commit_specify_mtr(&mtr);
    }
#endif
    return true;
  }
  return row_purge_poss_sec(node, index, entry);
}
```

I see following ways forward with this patch:

1. Trim it down only to the part which doesn't look at the page max trx id, and focus on the other heuristic which only looks at delete-mark, i.e. if record is not delete-marked then don't bother checking if it can be purged, because of course it can't be. I am not sure how much gain this offers, as I don't see how this scenario could arrive so frequently in the real-world system, to be a problem worth optimizing.

What would help is if someone can demonstrate a natural-looking example of such workload, AND that a patch like that helps to move the needle on some important metric like Transactions Per Second, or Undo Log History Length/Purge Speed.

Without a motivating example, there's no point in complicating the code further.

2. Keep both heuristics, and augment all the places which "forgot" to bump the page max trx id, to actually bump it.

What would help here is a contribution which actually does that in an elegant way, because I for now, have no good idea how to do that.
The main difficulty is that as of now, this bumping is done by Lock System, and thought to be part of Lock System (to speed up checks for implicit-locks), ...but also used in MVCC (to speed up visibility checks), so yet the responsibility to bump it at all, is somehow in the hand of upper layers, which can simply decide not to do that. Quite messy.
So changing this might require some change in vision/design/contract - perhaps we should simply force a simple rule that page max trx id in secondary index page *must* be an upper bound on the value of transactions' ids which modified the page, but still it's not clear what should be the exact place in code where we could enforce it.

3. Keep both heuristics, and do not fix any other place, instead add some more heuristics, so that for tables which don't are "not consistently candid" about page max trx id.

What would help here is if someone could propose a patch which specifies the condition which correctly captures all the cases where we use BTR_NO_LOCKING_FLAG or any other technique to avoid bumping page max trx id and a proof (ideally: expressed as asserts in code) that this is a sufficient rule.