MySQL Bugs: #109599: Support creating snapshot based on SCN

Bug #109599	Support creating snapshot based on SCN
Submitted:	12 Jan 2023 8:48	Modified:	7 Feb 2:30
Reporter:	zhai weixiang (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	Contribution

Description:
As we all know, InnoDB takes a snapshot by acuqring trx_sys mutex and then copying active transaction ids array. The implementation is simple, but
it also leads to mutex contention between read workload and write load, especially when there's many read-write transactions.

SCN, aka system commit number, can be an opition of taking snapshot. It avoids contention from read-write transaction,
though it has a problem which calls record cleanout(writing scn into records).  But It can be done in a lazy way(such as writing back scn by background thread)

I'll proposal a proof-of-concept patch so experts from oracle can evaluate feasibility of the feature request 

How to repeat:
N/A

Suggested fix:
I'll attach a patch later

a proof-of-concept patch, not fully tested (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_snapshot.diff (application/octet-stream, text), 126.79 KiB.

commit log:

What's SCN
=====
SCN is a sequence number which was generated while committing
transaction, and it should always be increased. With SCN, the
snapshot for MVCC can only use a version number to compare with
SCN.

To make code simple, we use trx->no as SCN of transaction. During
startup, the current trx id and no is initialized by max id stored
int trx_sys page, while max_trx_id is multiple of 2 and max_scn is not,
so we can distinguish transaction id and SCN stored in record field.
For example, if id on startup is 101, then max_scn is 101 and max_trx_id
is 102. the increasing step of trx id/scn is 2.

Get SCN
=====
While committing transaction, trx->no (aka SCN) is stored in undo header
of transaction.

Before undo log for insert is immediately purged after committing transaction.
In this commit, it'll keep insert undo not cleaned, so that user thread can get chance to read scn from undo header. The purge thread is resposible for removing all undo log.

In each clust record, it stores roll_ptr which points to undo record in undo space.
In order to get undo header address from undo log, we changed the format of undo record
by adding extra 9 bytes in header of undo record:
-- 2 bytes: point to next record
-- 1 bytes: 0xFF, indicate it's new version of undo record (added)
-- 2 bytes: low 2 bytes of trx id, for verification (added)
-- 4 bytes: page number of undo header (added)
-- 2 bytes: offset of undo header inside undo hdr page (added)
...
...

With this change we can find the undo hdr page, and get SCN number from it.

To avoid too frequent reading of undo log, it implementted a scn map which
stores trx_id->SCN, SCN_Map is a lock free structure. and it has two map:
- normal map: add id->scn to it while commtting transaction
- random map: add id->scn to it after extracting a scn from undo log while reading

WriteBack SCN
====
Before committing transaction, the physical record only stores trx id. We don't
fill in scn into record while committing transaction. Insteadly, we added record cursor
to buf_block_t object. The background thread will check in loop. And if trx committes, the
thread will take the cursor and write scn into it

It also stores cursors into buf_block_t while reading record and finding that the stored
id is not SCN.

Snapshot
====
Snapshot now has three fields to implement MVCC:
* For fast checking, up_limit_id and low_limit_id is still preserved, but it's not
always precise and just for fast checking.
* Max SCN value as its version number, all scn bigger than/equal to it is not visible.

Undo Purge
====
We don't prevent purging undo log, even when related SCN is not written back to record.
For insert undo, it just skips the record

Hello zhai,

Thank you for the feature request and contribution.

regards,
Umesh

fix some bugs (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: snapshot.diff (application/octet-stream, text), 134.82 KiB.

Hello Zhai Weixiang!

Interesting idea.

1. Can you share performance test results, preferably for an OLTP RW workload?
2. Does you patch pass mtr test suite?
3. The commit message and code say you store 2-bytes, but actually you store and verify just one due to 0xFF mask:
```
  /* Store low-two bytes of trx id for verification */
  mach_write_to_2(ptr, (0xFF & trx->id));
...
  uint16_t low_id = mach_read_from_2(ptr);
  if (low_id != (id & 0xFF)) {
```
Why do we need this field anyway? Is it to protect against some ABA race where the same Undo Page gets reused by a newer transaction while the DB_ROLL_PTR still point to that location? If so, then why do we check only 2 bytes, and not whole 8-byte long id? Is it ok to be wrong with probability 1/256 here? Would checking 8 bytes be sufficient?
4. Why do we need some extra new threads to modify TRX_ID fields of rows. Why not just let the purge threads do that? Or, perhaps let purge threads update the DB_ROLL_PTRs (either in the row itself or in appropriate place in the Undo Chain) to "null" (say, 1<<55 so it looks like IS_INSERT from nowhere), so that nobody is tempted to dereference them? Wouldn't that make synchronization between "we remove the Undo Log Header and might reuse it for something else" and "we made sure the B-tree no longer points to it" easier?
5. Why do we use two separate atomics for generating IDs and SCNs? Can't that cause a problem if SCN is lagging behind ID and you'll store this small SCN value in the trx_sys header? After restart we need to figure out what was the highest assigned ID value before the crash to properly ignore all smaller values found in record headers so that we don't treat them as Implicit Locks on records (among other things). Why not just have a single atomic counter, and return next_trx_id_or_no.fetch_add(1)*2+1 for SCN or next_trx_id_or_no.fetch_add(1)*2 for IDs?
6. IIUC the value at TRX_UNDO_TRX_NO offset of Undo Log Header is initially 0, and only set to valid trx->no during commit. IIUC `trx_undo_hdr_get_scn()` would return 0 and that would be forwarded by `trx_undo_get_scn()` to `SCN_Mgr::get_scn` which would return `TRX_ID_MAX`, which AFAIU is interpreted by `ReadView::changes_visible()` as "transaction is still active so you can't see its results". But, AFAIU we bump `next_trx_scn` in `trx_scn trx_sys_allocate_trx_no()` which is called *before* we update the Undo Log Header (because how else could we know what value to put in `TRX_UNDO_TRX_NO`. Moreover, we call `trx_scn trx_sys_allocate_trx_no()` even before we latch the Undo Header Page. Yet, the `ReadView::prepare()` does `m_version = trx_sys->get_max_trx_scn()` without any latch, so a ReadView of Trx#2 can be created immediately after a committing transaction Trx#1 has already called `trx_sys_allocate_trx_no()` but before Trx#1 has updated any of its Undo Log Headers. This ReadView of Trx#2 could then conclude that it should not see some modification made by Trx#1 if it happened to read the Undo Header Page too soon. If it later tried again, it would decide it should see it. Can this lead to inconsistent or non-repeatable read? Perhaps this could be fixed by letting the mtr of Trx#1 latch the Undo Header pages first, and only then call `trx_sys_allocate_trx_no()`, then write its value to both Undo Log Headers, and finally commit. This way nobody else could be able to latch, and hence read, the Undo Log Headers.
7. `SpinLock::write_lock()` has a life-lock bug AFAICT. Thankfully, you aren't using this function anywhere in this patch, but I thought I'll let you know in case you are using this utility class somewhere else. The problem would occur if two threads tried to call write_lock() at the same time - they would both wait for each other.
8. `Scn_Map::Elem::read(trx_id_t)` has a data-race when accessing `m_id` without latch. Consider changing it to `std::atomic<trx_id_t>`
9. Why does `LF_Array::get()` use `m_array[idx].compare_exchange_weak(value, 0)` instead of just `m_array[idx].store(0)`?

hello, jakub
Thnak you for your care review :)
bellow are my comments inline

Interesting idea.

1. Can you share performance test results, preferably for an OLTP RW workload?

I wrote the code long time ago and only test on our internal branch. The test the simple:
run two sysbench: one run point-select workload, and another run update-non-index workload
The point-select is about 70w tps vs 96w tps
The update-non-index is about 1.5w tps vs 2.1w tps

It removed hot trx-sys mutex from read view creation so both read/write can benefit from it

2. Does you patch pass mtr test suite?

NO, This is just a prof-of-concept patch, Not completely production ready. Most cases are passed, but there are some failure cases I didn't get time to analyze them :(

3. The commit message and code say you store 2-bytes, but actually you store and verify just one due to 0xFF mask:
```
  /* Store low-two bytes of trx id for verification */
  mach_write_to_2(ptr, (0xFF & trx->id));
...
  uint16_t low_id = mach_read_from_2(ptr);
  if (low_id != (id & 0xFF)) {
```
Why do we need this field anyway? Is it to protect against some ABA race where the same Undo Page gets reused by a newer transaction while the DB_ROLL_PTR still point to that location? If so, then why do we check only 2 bytes, and not whole 8-byte long id? Is it ok to be wrong with probability 1/256 here? Would checking 8 bytes be sufficient?

Yeah It's a bug, you are right , it should use 0xFFFF, actually it's just to avoid reading undo header page but don't 100% guarantee this. even it's wrong and pass the check, it'll verify again while reading the undo hdr

4. Why do we need some extra new threads to modify TRX_ID fields of rows. Why not just let the purge threads do that? Or, perhaps let purge threads update the DB_ROLL_PTRs (either in the row itself or in appropriate place in the Undo Chain) to "null" (say, 1<<55 so it looks like IS_INSERT from nowhere), so that nobody is tempted to dereference them? Wouldn't that make synchronization between "we remove the Undo Log Header and might reuse it for something else" and "we made sure the B-tree no longer points to it" easier?

Yeah using purge thread is a good idea. I was thinking of using seperate thread because:
1. Make code clear and don't make purging too complex (oh real reason is I'm not familiar with purging code :)
2. Perhaps in some cases writing scn back in  row field is not necessary. In other words, we don't need to write every scn into row field if they are not touched by 
reading. Because getting scn costs at most two reading of pages (undo log page and undo hdr page). Using extra threads can collect affected blocks and only deal with these blocks. 

(An optimization not realized yet: track smallest trx id not purged yet and all rows with trx_id smaller than it is always visible and don't need to check scn)

5. Why do we use two separate atomics for generating IDs and SCNs? Can't that cause a problem if SCN is lagging behind ID and you'll store this small SCN value in the trx_sys header? After restart we need to figure out what was the highest assigned ID value before the crash to properly ignore all smaller values found in record headers so that we don't treat them as Implicit Locks on records (among other things). Why not just have a single atomic counter, and return next_trx_id_or_no.fetch_add(1)*2+1 for SCN or next_trx_id_or_no.fetch_add(1)*2 for IDs?

The best way is storing them in two fileds of trx_sys page, But I don't want to break the format of page. So every time while writing to trx_sys header, the bigger one is stored. after restarting both scn/id always starts from the stored value.

related code in trx_sys_write_max_trx_id:
```
-    const trx_id_t max_trx_id = trx_sys->next_trx_id_or_no.load();
+    const trx_id_t max_trx_id = std::max(trx_sys->get_max_trx_id(),
+                                         trx_sys->get_max_trx_scn());
 
     mlog_write_ull(sys_header + TRX_SYS_TRX_ID_STORE, max_trx_id, &mtr);

```

If using only one counter, there must be many value wasted

6. IIUC the value at TRX_UNDO_TRX_NO offset of Undo Log Header is initially 0, and only set to valid trx->no during commit. IIUC `trx_undo_hdr_get_scn()` would return 0 and that would be forwarded by `trx_undo_get_scn()` to `SCN_Mgr::get_scn` which would return `TRX_ID_MAX`, which AFAIU is interpreted by `ReadView::changes_visible()` as "transaction is still active so you can't see its results". But, AFAIU we bump `next_trx_scn` in `trx_scn trx_sys_allocate_trx_no()` which is called *before* we update the Undo Log Header (because how else could we know what value to put in `TRX_UNDO_TRX_NO`. Moreover, we call `trx_scn trx_sys_allocate_trx_no()` even before we latch the Undo Header Page. Yet, the `ReadView::prepare()` does `m_version = trx_sys->get_max_trx_scn()` without any latch, so a ReadView of Trx#2 can be created immediately after a committing transaction Trx#1 has already called `trx_sys_allocate_trx_no()` but before Trx#1 has updated any of its Undo Log Headers. This ReadView of Trx#2 could then conclude that it should not see some modification made by Trx#1 if it happened to read the Undo Header Page too soon. If it later tried again, it would decide it should see it. Can this lead to inconsistent or non-repeatable read? Perhaps this could be fixed by letting the mtr of Trx#1 latch the Undo Header pages first, and only then call `trx_sys_allocate_trx_no()`, then write its value to both Undo Log Headers, and finally commit. This way nobody else could be able to latch, and hence read, the Undo Log Headers.

Yes!!!!YOU ARE RIGHT！！！！
This is a serious problem I didn't count!!! Thank you for pointing out. Your solution looks workable 

7. `SpinLock::write_lock()` has a life-lock bug AFAICT. Thankfully, you aren't using this function anywhere in this patch, but I thought I'll let you know in case you are using this utility class somewhere else. The problem would occur if two threads tried to call write_lock() at the same time - they would both wait for each other.

Yes it's a bug :(，It should fetch_add/ut_delay/fetch_sub in while loop. Luckily this function is not used yet.

8. `Scn_Map::Elem::read(trx_id_t)` has a data-race when accessing `m_id` without latch. Consider changing it to `std::atomic<trx_id_t>`

It should be fine because the element is protected by spinlock

9. Why does `LF_Array::get()` use `m_array[idx].compare_exchange_weak(value, 0)` instead of just `m_array[idx].store(0)`?

This is to avoid cocurrent writing to same slot.

Thanks for your detailed answers.

Ad 3. The scenario I am afraid of is that the Undo Page got freed (added to free list), then re-allocated for some others' transaction Undo Segment, and its content got overwritten. Now, could it happen, that by pure accident the content of this Undo Page got overwritten in such a way, that we will not notice that what we are reading is garbage? In particular, even if we are 'verifying' that some 8 bytes at some position in Undo Header Page look like trx->id, can it happen by pure chance? Say, because one of Undo Records mentions these particular 8 bytes as, say, previous value of DB_TRX_ID field of a B-tree record, and it so happens that this Undo Record is located exactly at the right offset for this mistake to go unnoticed?

Ad 5. Sorry, I've totally missed that you are taking a maximum of the two counters before storing - it's fine then, indeed.

Ad 8. I was referring to the access of `m_id` before accessing the spin lock:
```
+ trx_id_t read(trx_id_t id) {
+ trx_id_t ret = 0;
+ if (m_id != id) { <------------------ HERE
+ /* quick checking */
+ return ret;
+ }
+
+ if (!m_lock.try_read_lock()) {
```

Ad 9. From the comments in this class it looks like we expect only one consumer, so if the `value` is already not zero (i.e. one of producers has already finished writing to `value`) then who else could be modifying it? I think nobody, that's why I'd expect a simple `store`.
(If the idea was to support multiple consumers in future, then indeed this CAS might be helpful for correctnes, but I think this would not be a lock-free implementation, because if one of consumers "goes to sleep" right after `m_array[idx].compare_exchange_weak(value, 0)`, but before `m_consume_index++` then it will stall all the other consumers, as all of them will try to extract the same element over and over again until the original consumer wakes up and increments the `m_consume_index`.)

Hi, jakub

Ad3:
Yeah, we can't 100% guarantee correctness of scn, but we can make correctness approaching to 100%:
- verify two bytes of trx_id (done)
- comparing stored trx_id and given trx_id (done)
- checking table id (value stored in undo header and given table id) (done)
- sanity checking of scn (will add)
- sanity checking of other fileds of undo header (will add)

Ad8
It should be fine even reading a wrong m_id, because it will get scn by reading from undo log later. Dirty reading is cheap and spinlock is more expensive than it.

Ad9
Ohhh! Yes you are right. I checked the code again and the only consumer is cleanout thread in background. So calling store function should be fine 

I'm modifying the patch and plan to fix issues you mentioned above, also fixing some bugs found by running mtr. Will attach a newer version later.

latest version (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: snapshot_80_33.diff (application/octet-stream, text), 136.59 KiB.

Main Changes of latest version:

1. rebase to 8.0.33
2. remove instant scn cursor; Due to testing, it doesn't help RO workload and have side affect of DML workload. 
Now only reading will trigger cleanout of SCN. Code is also more clean comparing to previous version
3. Fix #6 by adding two sets for ReadView
While trying to get scn (get_scn_fast), it'll check if the transaction is active  .If yest, then checking its trx->no. If it's assigned and smaller than version of ReadView,its id/scn is added to set. So next time in changes_visible(),it's still invisible 
4. Fix some bugs found by running mtr
5. Fix bugs mentioned in comments above

bugfix (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: snapshot_80_33.diff (application/octet-stream, text), 139.61 KiB.

rebase to 9.0.1 (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_9_1.diff (text/x-patch), 135.05 KiB.

Thank you for your contribution.

Sincerely,
Umesh

bugfix (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: bug10599.diff (application/octet-stream, text), 135.77 KiB.

latest version not fully tested (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn.diff (application/octet-stream, text), 194.35 KiB.

Main changes of the attached file:
- remove writing back of SCN, as it seems unnecessary, and it can greately simplify the code
- remove usage of trx_sys_t::rw_trx_list, trx_sys_t::rw_trx_ids,
trx_sys_t::serialisation_mutex, trx_sys_t::serialisation_list
Transactions is taken from trx_sys_t::shards, and its size is extented to 512:
shard 0,2,4....510 is used to store active transactions
shard 1,3,5,...511 is used to store committing transactions
clone0repl.cc: gtid list is sharded

It looks like you've put a lot of work into it.
I feel bad I can't do this work justice by reviewing it thoroughly right now.

The problem is that this patch is now huge and will take a lot of work to review and prove correctness.
This is almost impossible to do without documentation explaining what is the high level idea, what are the intended properties and invariants, happens-before relations etc.
I think this patch has some good ideas in it, but I am not sure how to proceed with it given that there's a parallel effort to reimplement MVCC in yet a different way. Looks like only some fragments could be reused.

I only had time now to skim over this patch, just to see how mature it looks at the first glance.
I see some problems with it:

1. the SpinLock implementation looks buggy to me. Or at least I couldn't convince myself that following places are correct
1.1. the `write_lock` does fetch_sub early on, then waits for value to reach 0. This means that if two threads call `write_lock` in parallel, neither of them will ever succeed.
1.1.1 also it doesn't seem to be used anywhere
1.2. the try_read_lock stores result of `fetch_sub` in `uint32_t`, which means it treats negative numbers as positive.
2. I couldn't find any usage of LF_Array
3. It is a bit confusing that the same concept (?) is represented by various names: no, scn, scn_id, version. I think this patch could benefit from using more consistent naming
4. for some reason `ReadView::changes_visible` now takes `buf_block_t*block` argument, but I couldn't find any usage of block
5. the `changes_visible` has some logic to handle a case where the id stored in the row is already an scn, but I could not figure out how that could happen
6. Scn_Map seems to have constant size of array, but instead of using constexpr the code uses define, and then dynamically assigns it to a non-constexpr field, m_size, which in turn means that `%m_size` operation requires a rather slow division operation instead of being optimized to bitwise and

Taken together this gives me an impression of "Proof of Concept". Don't get me wrong - this is already great, but it looks like more work will be needed.

IIUC the idea is to add 2+4+4 bytes to every undo log record so that we can quickly locate the header. I am afraid this might increase the amount of data to be written too much, and result not only in Undo Log Tablespace size increase, but more importantly: raises the amount of bytes the io has to write to undo log and to redo log. What is the impact on Transactions Per Second metric?

I wonder if there is a way to get rid of these 2+4+4 bytes by the following observation (perhaps wrong, I haven't thought it through): the roll_ptr contains enough info to find the page. A single Undo Log Segment consists of 1 or more pages, of which the 1st page can be shared by multiple short Undo Logs, but the later pages (if any) have to be the part of the last Undo Log from the 1st page. So, if we could somehow deduce if the page we look at is "the 1st" or "later", then we could find its header in following way.
If it is "the 1st" then we are already at the right page, we "just" need to find the right header on it. One approach is to jump through TRX_UNDO_NEXT_LOG links between headers. I am not sure how slow this would be in practice: a single undo log header is rather large (due to XID Info, GTID etc.) so I doubt there can be more than 100 Undo Logs in single page, and I don't think a loop of 100 comparisons is very slow (also, we can add all the headers we iterated over to Scn_map cache, so we never iterate over them again).
If it is "later", then all we need is to somehow find the 1st page. The slow approach is to use FLST_PREV links between Undo Log Pages. But maybe we could add a new header field to Undo Log Page which lets us jump directly to 1st page?
The benefit here is that this is just one (4byte long) field per page, as opposed to per-Undo Log Record.

My biggest worry with this patch is the slow-path. AFAIU in case there is one long running transaction (say, several minutes) while there is a lot of shorter transactions (say 10k / second), we might be in a situation where answering the visibility question can't be answered by cheap comparisons, and will require consulting the Scn_maps, and if they are missing the info, then we might need to read the undo log page pointed by the roll pointer, and then the one with the undo log header (hopefully for short transactions it is the same page). IIUC you minimize the chance of this happening by having two hashmaps: one which should contain info about most recently committed transactions, and another which caches answers for those even older. Given the size of the cache (1M entries?) indeed it looks promising for most cases, but I'd like to see some numbers and how it handles worst case (a *very* long running transaction and *a lot* of concurrent shorter ones).

I see there are changes to GTID persistor, but I haven't looked into them. This is in itself a complicated topic and I'd have to recall all the necessary properties for it to work correctly.

I see some changes to trx_undo_insert_header_reuse, which worries me, as I think it is important for performance that InnoDB reuses undo logs in particular way. I'd need more time to think it through.

I wish I had more time to review all of that.

I've tried to apply this patch to mysql-trunk (@339b57bb1f6c), and run mtr tests on mysqld built with -DWITH_ASAN=1 -DWITH_DEBUG=1.

Initially it failed during initialization, because `MVCC::view_close(..,own_mutex)` calls `ut_ad(validate(slot))` without holing trx_sys->mutex yet `MVCC::validate` still asserts this mutex should be held.
Also, I think the call to `validate` should be done before `exit(slot)`.
Thus, I've removed this particular assertion, and moved validate() before exit().
It still fails on other asserts, such as:
```
#0  0x00007fe9c20c9ac5 in pthread_kill () from /lib64/libpthread.so.0
#1  0x0000000005caba34 in my_write_core (sig=sig@entry=6) at mysql/mysys/stacktrace.cc:340
#2  0x00000000038cf187 in handle_fatal_signal (sig=6, info=0x7fe9b55ef270, ucontext=0x7fe9b55ef140) at mysql/sql/signal_handler.cc:407
#3  <signal handler called>
#4  0x00007fe9bfd2a5ef in raise () from /lib64/libc.so.6
#5  0x00007fe9bfcfde65 in abort () from /lib64/libc.so.6
#6  0x00000000038cf414 in my_server_abort () at mysql/sql/signal_handler.cc:463
#7  0x0000000005c9e66c in my_abort () at mysql/mysys/my_init.cc:262
#8  0x000000000630434f in ut_dbg_assertion_failed (expr=expr@entry=0x917ca60 "trx_id != 0", file=file@entry=0x917c9e0 "mysql/storage/innobase/include/trx0sys.h", line=line@entry=364) at mysql/storage/innobase/ut/ut0dbg.cc:100
#9  0x0000000005f43947 in trx_get_shard_no (trx_id=trx_id@entry=0) at mysql/storage/innobase/include/trx0sys.h:364
#10 0x0000000005f439c1 in trx_sys_t::get_shard_by_trx_id (this=<optimized out>, trx_id=trx_id@entry=0) at mysql/storage/innobase/include/trx0sys.h:634
#11 0x00000000060c63b7 in trx_rw_is_active (trx_id=trx_id@entry=0, do_ref_count=do_ref_count@entry=true) at mysql/storage/innobase/include/trx0sys.ic:218
#12 0x00000000060c6756 in SCN_Mgr::get_scn_fast (this=0x6060000b7860, id=0, committing_version=committing_version@entry=0x0) at mysql/storage/innobase/read/read0read.cc:859
#13 0x00000000065f0c51 in dd_open_table_one<dd::Table> (client=client@entry=0x613000013d00, table=table@entry=0x61e00005c4a0, norm_name=norm_name@entry=0x7fe9b55f1570 "mysql/dd_properties", dd_table=dd_table@entry=0x619000085b58, thd=thd@entry=0x62700000a900, fk_list=std::deque with 0 elements) at mysql/storage/innobase/dict/dict0dd.cc:5210
#14 0x00000000065f19cf in dd_open_table<dd::Table> (client=client@entry=0x613000013d00, table=<optimized out>, norm_name=norm_name@entry=0x7fe9b55f1570 "mysql/dd_properties", dd_table=dd_table@entry=0x619000085b58, thd=thd@entry=0x62700000a900) at mysql/storage/innobase/dict/dict0dd.cc:5435
#15 0x0000000005daa7a5 in ha_innobase::open (this=0x618000016cb0, name=<optimized out>, open_flags=<optimized out>, table_def=<optimized out>) at mysql/storage/innobase/handler/ha_innodb.cc:7565
#16 0x0000000003b9d439 in handler::ha_open (this=0x618000016cb0, table_arg=table_arg@entry=0x61e00005c4a0, name=<optimized out>, mode=mode@entry=2, test_if_locked=<optimized out>, table_def=<optimized out>) at mysql/sql/handler.cc:2875
#17 0x00000000037c2ca2 in open_table_from_share (thd=thd@entry=0x62700000a900, share=0x619000085cb0, alias=<optimized out>, db_stat=db_stat@entry=39, prgflag=prgflag@entry=8, ha_open_flags=<optimized out>, outparam=<optimized out>, is_create_table=<optimized out>, table_def_param=<optimized out>) at mysql/sql/table.cc:3218
#18 0x0000000003314999 in open_table (thd=thd@entry=0x62700000a900, table_list=table_list@entry=0x61b000011188, ot_ctx=ot_ctx@entry=0x7fe9b55f2780) at mysql/sql/sql_base.cc:3468
#19 0x0000000003315c28 in open_and_process_table (thd=thd@entry=0x62700000a900, lex=<optimized out>, tables=tables@entry=0x61b000011188, counter=counter@entry=0x7fe9b55f28f0, prelocking_strategy=prelocking_strategy@entry=0x7fe9b55f2850, has_prelocking_list=<optimized out>, ot_ctx=<optimized out>) at mysql/sql/sql_base.cc:5147
#20 0x000000000331689d in open_tables (thd=thd@entry=0x62700000a900, start=start@entry=0x7fe9b55f2940, counter=counter@entry=0x7fe9b55f28f0, flags=flags@entry=18434, prelocking_strategy=prelocking_strategy@entry=0x7fe9b55f2850) at mysql/sql/sql_base.cc:5972
#21 0x000000000333157e in open_tables (thd=<optimized out>, tables=tables@entry=0x7fe9b55f2940, counter=counter@entry=0x7fe9b55f28f0, flags=flags@entry=18434) at mysql/sql/sql_base.h:455
#22 0x0000000005c47e78 in dd::Open_dictionary_tables_ctx::open_tables (this=this@entry=0x7fe9b55f2ae0) at mysql/sql/dd/impl/transaction_impl.cc:107
#23 0x0000000005b0ed50 in dd::tables::DD_properties::init_cached_properties (this=this@entry=0x61a000056a80, thd=thd@entry=0x62700000a900) at mysql/sql/dd/impl/tables/dd_properties.cc:143
#24 0x0000000005b0f159 in dd::tables::DD_properties::unchecked_get (this=this@entry=0x61a000056a80, thd=thd@entry=0x62700000a900, key="DD_VERSION", value=value@entry=0x7fe9b55f2d60, exists=exists@entry=0x7fe9b55f2e20) at mysql/sql/dd/impl/tables/dd_properties.cc:229
#25 0x0000000005b0f591 in dd::tables::DD_properties::get (this=this@entry=0x61a000056a80, thd=thd@entry=0x62700000a900, key="DD_VERSION", value=value@entry=0x7fe9b55f2e90, exists=exists@entry=0x7fe9b55f2e20) at mysql/sql/dd/impl/tables/dd_properties.cc:263
#26 0x0000000005858f00 in dd::initialize_dd_properties (thd=thd@entry=0x62700000a900) at mysql/sql/dd/impl/bootstrap/bootstrapper.cc:1162
#27 0x000000000586104b in dd::bootstrap::restart_dictionary (thd=0x62700000a900) at mysql/sql/dd/impl/bootstrap/bootstrapper.cc:941
#28 0x0000000003ad307c in bootstrap::handle_bootstrap (arg=arg@entry=0x7ffd4511e0b0) at mysql/sql/bootstrap.cc:340
#29 0x000000000709ea20 in pfs_spawn_thread (arg=0x614000001e60) at mysql/storage/perfschema/pfs.cc:3067
#30 0x00007fe9c20c21da in start_thread () from /lib64/libpthread.so.0
#31 0x00007fe9bfd158d3 in clone () from /lib64/libc.so.6
```
which I believe happens because we need to access DD tables before you initialize purge_sys->min_up_id.

I think this means you have not tried to test this patch on a debug build.

As a quick and dirty fix to get things going I've changed:
```
-  if (id < purge_sys->min_up_id.load(std::memory_order_relaxed)) {
+  if (id < purge_sys->min_up_id.load(std::memory_order_relaxed) || id == 0) {
```
which made most of the tests pass.

Among failing tests is:
1. innodb_bug16417635.test (I have not investigated what's the root cause, but the test has very strict opinion about what should be the content of the Buffer Pool after undo log purge, so perhaps there's some issue with the new way purge operates)

2. 
```
2025-01-21T12:53:26.290277Z 1 [ERROR] [MY-012988] [InnoDB] Thread 139666895415040 already owns a latch PURGE_SYS_PQ at level 46 (SYNC_PURGE_QUEUE ), which is at a lower/same level than the requested latch: 51 (SYNC_TRX_SYS_SHARD). Mutex PURGE_SYS_PQ created trx0purge.cc:233 addr: 0x618000014358 acquired: trx0trx.cc:1522
2025-01-21T12:53:26.290339Z 1 [ERROR] [MY-012986] [InnoDB] Latches already owned by this thread:
2025-01-21T12:53:26.290386Z 1 [ERROR] [MY-012987] [InnoDB] UNDO_SPACE_RSEG -> 72 (SYNC_UNDO_SPACE_RSEG)
2025-01-21T12:53:26.290422Z 1 [ERROR] [MY-012987] [InnoDB] PURGE_SYS_PQ -> 46 (SYNC_PURGE_QUEUE)
2025-01-21T12:53:26.290451Z 1 [ERROR] [MY-013183] [InnoDB] Assertion failure: sync0debug.cc:599 thread 139666895415040
...
/storage/innobase/sync/sync0debug.cc:197
#12 0x000000000625a2f8 in LatchDebug::check_order (this=this@entry=0x60f000007860, latch=latch@entry=0x7f06bbaff720, level=level@entry=SYNC_TRX_SYS_SHARD) at mysql/storage/innobase/sync/sync0debug.cc:771
#13 0x0000000006272288 in LatchDebug::lock_validate (this=this@entry=0x60f000007860, latch=latch@entry=0x7f06bbaff720, level=<optimized out>) at mysql/storage/innobase/sync/sync0debug.cc:233
#14 0x000000000625a90f in sync_check_lock_validate (latch=latch@entry=0x7f06bbaff720) at mysql/storage/innobase/sync/sync0debug.cc:1025
#15 0x0000000005ddafe2 in MutexDebug<TTASEventMutex<GenericPolicy> >::enter (this=this@entry=0x7f06b690b010, mutex=mutex@entry=0x7f06b690b000, name=name@entry=0x929bfe0 "mysql/storage/innobase/trx/trx0trx.cc", line=line@entry=1476) at mysql/storage/innobase/include/sync0policy.ic:64
#16 0x0000000005ddb093 in GenericPolicy<TTASEventMutex<GenericPolicy> >::enter (this=this@entry=0x7f06b690b010, mutex=..., filename=filename@entry=0x929bfe0 "mysql/storage/innobase/trx/trx0trx.cc", line=line@entry=1476) at mysql/storage/innobase/include/sync0policy.h:289
#17 0x0000000005dfdfc4 in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=this@entry=0x7f06b690b000, n_spins=<optimized out>, n_delay=<optimized out>, name=name@entry=0x929bfe0 "mysql/storage/innobase/trx/trx0trx.cc", line=line@entry=1476) at mysql/storage/innobase/include/ib0mutex.h:623
#18 0x0000000005dfe169 in mutex_enter_inline<PolicyMutex<TTASEventMutex<GenericPolicy> > > (m=<optimized out>, loc=...) at mysql/storage/innobase/include/ut0mutex.h:113
#19 0x0000000005f5caac in IB_mutex_guard::IB_mutex_guard (this=<optimized out>, in_mutex=<optimized out>, location=...) at mysql/storage/innobase/include/ut0mutex.h:135
#20 0x00000000062e2b11 in ut::Guarded<Trx_by_id_with_min, (latch_id_t)79>::latch_and_execute<trx_add_to_serialisation_list(trx_t*)::<lambda(Trx_by_id_with_min&)> >(struct {...} &&, const ut::Location &) (this=0x7f06b690b000, f=..., loc=...) at mysql/storage/innobase/include/ut0guarded.h:50
#21 0x00000000062e2d99 in trx_add_to_serialisation_list (trx=trx@entry=0x7f06ba7f6118) at mysql/storage/innobase/trx/trx0trx.cc:1472
#22 0x00000000062e3c65 in trx_serialisation_number_get (trx=trx@entry=0x7f06ba7f6118, redo_rseg_undo_ptr=<optimized out>, temp_rseg_undo_ptr=<optimized out>) at mysql/storage/innobase/trx/trx0trx.cc:1524
#23 0x00000000062e41f0 in trx_write_serialisation_history (trx=trx@entry=0x7f06ba7f6118, mtr=mtr@entry=0x7f06bbb001d0) at mysql/storage/innobase/trx/trx0trx.cc:1597
#24 0x00000000062e5494 in trx_commit_low (trx=trx@entry=0x7f06ba7f6118, mtr=0x7f06bbb001d0) at mysql/storage/innobase/trx/trx0trx.cc:2168
#25 0x00000000062e57f5 in trx_commit (trx=trx@entry=0x7f06ba7f6118) at mysql/storage/innobase/trx/trx0trx.cc:2245
#26 0x00000000062e5e09 in trx_commit_for_mysql (trx=trx@entry=0x7f06ba7f6118) at mysql/storage/innobase/trx/trx0trx.cc:2452
#27 0x0000000005d6b351 in innobase_commit_low (trx=trx@entry=0x7f06ba7f6118) at mysql/storage/innobase/handler/ha_innodb.cc:5941
#28 0x0000000005d9e3c1 in innobase_commit (hton=<optimized out>, thd=<optimized out>, commit_trx=<optimized out>) at mysql/storage/innobase/handler/ha_innodb.cc:6108
```
this is serious, as it can lead to a deadlock. This seems to come from your change of `trx_add_to_serialisation_list` which now tries to latch just a single trx sys shard, but does so while holding purge_sys->mutex.
In general this kind of latch order violations can be checked by running ./mtr with --mysqld=--innodb_sync_debug=1 flag, so that the test runner passes --innodb_sync_debug=1 to the mysqld processes, so that they can validate the latching order.
This does not happen by default, but some of the mtr tests force that in their *-master.opt file, or by directly restarting the server with this option.
One way to resolve it would be to figure out if the latch ordering should be changed, by changing the value of SYNC_TRX_SYS_SHARD constant. Another is to restructure the code so it respects original latching order.

3. There is also a bug where you seem to try to "recursively" acquire the same mutex twice during innodb.innodb_bug-13628249 test:
```
#7  0x0000000005c9e66c in my_abort () at mysql/mysys/my_init.cc:262
#8  0x000000000630435d in ut_dbg_assertion_failed (expr=expr@entry=0x910b440 "!is_owned()", file=file@entry=0x910b260 "mysql/storage/innobase/include/sync0policy.ic", line=line@entry=63) at mysql/storage/innobase/ut/ut0dbg.cc:100
#9  0x0000000005ddb042 in MutexDebug<TTASEventMutex<GenericPolicy> >::enter (this=this@entry=0x7f56dbc1f1a8, mutex=mutex@entry=0x7f56dbc1f198, name=name@entry=0x929bfe0 "mysql/storage/innobase/trx/trx0trx.cc", line=line@entry=1898) at mysql/storage/innobase/include/sync0policy.ic:64
#10 0x0000000005ddb093 in GenericPolicy<TTASEventMutex<GenericPolicy> >::enter (this=this@entry=0x7f56dbc1f1a8, mutex=..., filename=filename@entry=0x929bfe0 "mysql/storage/innobase/trx/trx0trx.cc", line=line@entry=1898) at mysql/storage/innobase/include/sync0policy.h:289
#11 0x0000000005dfdfc4 in PolicyMutex<TTASEventMutex<GenericPolicy> >::enter (this=this@entry=0x7f56dbc1f198, n_spins=<optimized out>, n_delay=<optimized out>, name=name@entry=0x929bfe0 "mysql/storage/innobase/trx/trx0trx.cc", line=line@entry=1898) at mysql/storage/innobase/include/ib0mutex.h:623
#12 0x0000000005dfe169 in mutex_enter_inline<PolicyMutex<TTASEventMutex<GenericPolicy> > > (m=<optimized out>, loc=...) at mysql/storage/innobase/include/ut0mutex.h:113
#13 0x0000000005f5caac in IB_mutex_guard::IB_mutex_guard (this=<optimized out>, in_mutex=<optimized out>, location=...) at mysql/storage/innobase/include/ut0mutex.h:135
#14 0x00000000062e1753 in ut::Guarded<Trx_by_id_with_min, (latch_id_t)79>::latch_and_execute<trx_release_impl_and_expl_locks(trx_t*, bool)::<lambda(Trx_by_id_with_min&)> >(struct {...} &&, const ut::Location &) (this=0x7f56dbc1f198, f=..., loc=...) at mysql/storage/innobase/include/ut0guarded.h:50
#15 0x00000000062e1af9 in trx_release_impl_and_expl_locks (trx=trx@entry=0x7f56df2f5ca0, serialised=serialised@entry=false) at mysql/storage/innobase/trx/trx0trx.cc:1890
#16 0x00000000062e69ce in trx_free_prepared_or_active_recovered (trx=0x7f56df2f5ca0) at mysql/storage/innobase/trx/trx0trx.cc:629
#17 0x00000000062cd7d2 in Trx_by_id_with_min::free_prepared (this=this@entry=0x7f56dbc1f250) at mysql/storage/innobase/trx/trx0sys.cc:649
#18 0x00000000062cd874 in operator() (__closure=<optimized out>, trx_by_id_with_min=...) at mysql/storage/innobase/trx/trx0sys.cc:686
#19 0x00000000062cff46 in ut::Guarded<Trx_by_id_with_min, (latch_id_t)79>::latch_and_execute<trx_sys_close()::<lambda(Trx_by_id_with_min&)> >(struct {...} &&, const ut::Location &) (this=0x7f56dbc1f198, f=..., loc=...) at mysql/storage/innobase/include/ut0guarded.h:50
#20 0x00000000062d1723 in trx_sys_close () at mysql/storage/innobase/trx/trx0sys.cc:684
#21 0x0000000006223f88 in srv_shutdown () at mysql/storage/innobase/srv/srv0start.cc:3065
```
I believe it has to do with your proposal changing `trx_sys_close()` so that it loops over shards, to call `trx_by_id_with_min.free_prepared()` under shard mutex, but then the trx_free_prepared_or_active_recovered calls trx_release_impl_and_expl_locks  which also tries to enter the same shard in your new implementation.
This should basically cause a deadlock.

4. The innodb_undo.truncate_explicit fails as the size after "truncate" of undo log is not what it expects.

5. Several tests have failures which indicate some problem with purge, or at least the way the test tries to interact with purge, for example innodb.lob_ibd_size is surprised:
```
+Error: The ibd file 'test/my_sample_table.ibd' is growing.
```

Among failing tests, are:

innodb.fast_shutdown innodb.implicit_to_explicit_conversion innodb.index_merge_threshold innodb.innodb_buffer_pool_resize_with_instances innodb.innodb_bug-13628249 innodb.innodb_bug30113362 innodb.innodb_bug59641 innodb.innodb_status_utf8_truncate innodb.lob_ibd_size innodb.lob_ibd_size_replace innodb.max_trx_id innodb.xa_recovery innodb_undo.truncate innodb_undo.truncate_explicit innodb_undo.truncate_xa main.gtid_next_xa_binlog_off main.implicit_commit main.information_schema_statistics main.xa_default main.xa_mdl_backup

Hi, Jakub 
Thank you for your kind comment
Yes, it's still a proof-of-concept patch, I only run sysbench to verify performance improvement with release build. I didn't run the test case. Posting the patch aims to show the basic idea of how to get rid of trx_sys->mutex and It seems work.

I'll try to run test case with debug build

fix some cores (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0122.diff (application/octet-stream, text), 194.03 KiB.

Done:
1. cores menthoned above is fixed
2. remove unused code
3. add commit_rw_trxs in Trx_shard, and class Trx_commit_serialisation_list

TBD:
1. optimize slow path of getting scn
2. there's some issue related to undo purge need to be fixed

let purge thread/gtid flush thread using latest low limit no, so truncating related cases are passed

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0123.diff (application/octet-stream, text), 193.94 KiB.

Hi, jakub, your idea about undo log looks workable, I plan to make the following change:
1. if undo log resides in the first page of list, traverse the list(according to TRX_UNDO_NEXT_LOG), to find the right offset of log hdr 
2. If undo log not in first page, then all log in this page should belong to same transaction, I think we can write log header at end of the undo page (for example, reserve 8(trx id) + 4(page no) + 2(offset) at end of the page, and reset end when the page is reused)

How to check if it's the first page of undo chain:
- read page + TRX_UNDO_PAGE_HDR + TRX_UNDO_PAGE_START, if it equals to TRX_UNDO_PAGE_HDR + TRX_UNDO_PAGE_HDR_SIZE, it's not first one, otherwise it is

Do you think will this works ?

Yes, all Undo Log Records which are on an Undo Page which is not the first page of Undo Log Segment belong to the same transaction.

Yes, you could use the "footer" of such page to store additional info. Just make sure to adjust `trx_undo_report_row_operation` and `trx_undo_erase_page_end` and handling of `MLOG_UNDO_ERASE_END` so that they respect this change. Also for backward compatibility, you'd need a fallback logic which just follows the links between Undo Log pages to find the first one. Thankfully this is easy to check because in current implementation the suffix of the page is filled with zeros, and (trx_id,page_no,offset) can't be all zeros.

BTW. trx_id is just 6 bytes.

Yes, I think you can distinguish first page from non-first page by looking at the TRX_UNDO_PAGE_START.

One more thing you might want to experiment with is to replace your spinlocks with Seq_lock from ut0seq_lock.h - you just need to add a new `locking_write` method to it because the logic of the existing `write` method assumes there's some external mutual exclusion mechanism to let just one writer into this method. You can just copy&paste `write` and just change the beginning like this:
- const auto old = m_seq.load(std::memory_order_relaxed);
+ auto old = m_seq.load(std::memory_order_relaxed);

/* The odd value means someone else is executing the write operation
concurrently, and this is not allowed. */
- ut_ad((old & 1) == 0);
+ while(true){
+ if (old&1) {
+ ut_delay(1);
+ old = m_seq.load(std::memory_order_relaxed);
+ } else if(m_seq.compare_exchange_weak(old, old + 1){
+ break;
+ }
+ }

The advantage of Seq_lock w.r.t. to spinlock is that readers will not have to modify any memory location at all.

Also, one of my big worries about your patch is that reading a page which could potentially be freed and reused for something else is tricky business.
I think `trx_undo_get_scn` should do as much verification as possible. At minimum check the FIL_PAGE_TYPE.
I think the approach I proposed, where you iterate over Undo Log Headers, as opposed to your approach where you jump straight to the "header" has one benefit: if the page was reused/rewritten, and now contains something else, it will be easy to spot even if bytes got rearranged in a pattern which accidentally seems to match trx->id.
I think combined with your idea to check if the page is first or not by looking at TRX_UNDO_PAGE_START it could lead to quite robust mechanism.
If we see the page is undo page (has right FIL_PAGE_TYPE) and TRX_UNDO_PAGE_START indicates it is the first page, and iterating over headers we found one which matches trx->id, then I think it is a correct conclusion that this is indeed the header for that trx (as opposed to some random garbage/other info accidentally arranging in a pattern which matches).

Another big worry is that it could be somehow incorrect in that removing acquisition of trx_sys->mutex from crucial lifecycle events of a trx, might lead to some subtle races or violations of assumptions. As of now I don't see any specific problem, but this doesn't mean it doesn't have one. Verifying this patch will take me a lot of time, and I wonder if there's something you could do to help prove correctness or at least provide more evidence for it.

Meanwhile I've run some perf tests on a beefy machine.
sb_exec/sb11-OLTP_RW_10M_8tab-${test}-ps-trx.sh:
pareto 64
 3 28214 +- 81 ( 28070 < 28193 < 28295 ) "origin/mysql-trunk baseline 339b57bb"
 3 28176 +- 38 ( 28138 < 28201 < 28290 ) "origin/mysql-trunk-bug34972994-openended-nolist"
 3 26368 +- 50 ( 26318 < 26372 < 26430 ) "origin/mysql-trunk-bug34972994-contrib"
pareto 1024
 3 32711 +- 21 ( 32356 < 32599 < 32732 ) "origin/mysql-trunk baseline 339b57bb"
 3 32424 +- 24 ( 32400 < 32440 < 32498 ) "origin/mysql-trunk-bug34972994-openended-nolist"
 3 30855 +- 81 ( 30550 < 30780 < 30936 ) "origin/mysql-trunk-bug34972994-contrib"
uniform 64
 3 27511 +- 115 ( 27136 < 27424 < 27626 ) "origin/mysql-trunk-bug34972994-openended-nolist"
 3 27291 +- 109 ( 27182 < 27415 < 27773 ) "origin/mysql-trunk baseline 339b57bb"
 3 27224 +- 388 ( 26836 < 27240 < 27661 ) "origin/mysql-trunk-bug34972994-contrib"
uniform 128
 3 36139 +- 296 ( 35843 < 36149 < 36465 ) "origin/mysql-trunk-bug34972994-contrib"
 3 35832 +- 209 ( 35599 < 35824 < 36041 ) "origin/mysql-trunk baseline 339b57bb"
 3 35692 +- 186 ( 35506 < 35749 < 36051 ) "origin/mysql-trunk-bug34972994-openended-nolist"
uniform 1024
 3 36650 +- 15 ( 36495 < 36603 < 36665 ) "origin/mysql-trunk baseline 339b57bb"
 3 36503 +- 91 ( 36357 < 36484 < 36594 ) "origin/mysql-trunk-bug34972994-contrib"
 3 36307 +- 1  ( 36306 < 36314 < 36330 ) "origin/mysql-trunk-bug34972994-openended-nolist"

sb11-oltp_update_index_10M_8tab-${test}-ps-trx.sh:
pareto 64
 10 180678 +- 1055 ( 179087 < 181370 < 185864 ) "origin/mysql-trunk baseline 339b57bb"
 10 179933 +- 1732 ( 166414 < 177222 < 183726 ) "origin/mysql-trunk-bug34972994-openended-nolist"
 10 165877 +- 1382 ( 161727 < 165887 < 169226 ) "origin/mysql-trunk-bug34972994-contrib"
pareto 128
 10 193251 +- 1178 ( 190447 < 194093 < 199610 ) "origin/mysql-trunk-bug34972994-openended-nolist"
 10 181398 +- 5805 ( 174586 < 182348 < 190966 ) "origin/mysql-trunk-bug34972994-contrib"
 10 163623 +- 8485 ( 150386 < 164797 < 174027 ) "origin/mysql-trunk baseline 339b57bb"
pareto 1024
 10 180366 +- 2955 ( 175815 < 180205 < 184500 ) "origin/mysql-trunk-bug34972994-contrib"
 10 164066 +- 1961 ( 158174 < 164290 < 168272 ) "origin/mysql-trunk-bug34972994-openended-nolist"
 10 124976 +- 3408 ( 112776 < 124450 < 131974 ) "origin/mysql-trunk baseline 339b57bb"
uniform 64
 10 178097 +- 468 ( 175460 < 178423 < 181122 ) "origin/mysql-trunk baseline 339b57bb"
 10 177621 +- 966 ( 174808 < 177617 < 181730 ) "origin/mysql-trunk-bug34972994-openended-nolist"
 10 174287 +- 1098 ( 172905 < 174532 < 177320 ) "origin/mysql-trunk-bug34972994-contrib"
uniform 128
 10 227030 +- 5812 ( 212292 < 227282 < 240196 ) "origin/mysql-trunk-bug34972994-contrib"
 10 208370 +- 2740 ( 195690 < 208209 < 214985 ) "origin/mysql-trunk-bug34972994-openended-nolist"
 10 162126 +- 6118 ( 139215 < 157786 < 168395 ) "origin/mysql-trunk baseline 339b57bb"
uniform 1024
 10 325591 +- 9238 ( 309929 < 326531 < 343347 ) "origin/mysql-trunk-bug34972994-contrib"
 10 114576 +- 2663 ( 105816 < 114261 < 122907 ) "origin/mysql-trunk baseline 339b57bb"
 10  91404 +- 1565 (  88129 <  91382 <  94491 ) "origin/mysql-trunk-bug34972994-openended-nolist"

The mysql-trunk-bug34972994-openended-nolist is one of the versions of new MVCC implementation, which has one thing in common with yours that it eliminates trx_sys->mutex acquistion on trx start.
The result for "UPDATE KEY uniform 1024" scenario looks most impressive but also most puzzling.
Also for "UPDATE KEY pareto 64" somehow we see a slow down, which is also puzzling.
I hope to investigate root causes soon.

I did the experiment I've suggested earlier - see what happens if I run, say OLTP RW uniform with 64 clients while there is an active long running rw-transaction.
The reason I've suspected this could lead to a problem is that such a long running transaction would have relatively low trx->id, which has to be accounted to in each new read view created, so it will not be able to use the quick exit of:
```
   if (id < m_up_limit_id) {
     return true;
   }
```
because `m_up_limit` will be small. So it will have to call `get_scn`, which first tries `get_scn_fast`, which in turn will first try to use `if (id < purge_sys->min_up_id.load(std::memory_order_relaxed) || id == 0) {` as a quick exit, but this will also not work, because min_up_id will also be small. So it will have to use `m_scn_map->read(id)`. As most of the rows in the db were not modified by most recent 1M of transactions (I am using a setup with 8 tables with 10M rows each, and single transaction modifies just a few via   execute_index_updates(); execute_non_index_updates();  execute_delete_inserts() which AFAICT at most modifies 3-4 rows), this will probably also fail. Then it will try m_random_map, which again caches just 1M entries, so a chance of cache hit is probably at the order of 1:80 (I haven't measured). So it will have to dive.
The exact script I've used is this:
```
#!/bin/bash
rev=$1
cd ~/mysql-tests
./run_cached_sysbench_on_hypothesis.sh mysql-trunk $rev 64 sb_exec/sb11-OLTP_RW_10M_8tab-uniform-ps-trx.sh &
echo "Sleeping for 60 sec"
sleep 60
echo "Trying to connect"
while ! apps/mysql-trunk/mysql.sh <<< "SHOW DATABASES";
do
   echo "Could not connect yet, sleep for 1 sec"
   sleep 1;
done
echo "Connected. Now start a 300 sec trx";
apps/mysql-trunk/mysql.sh <<< "CREATE TABLE sysbench.t1 (id INT PRIMARY KEY);BEGIN; INSERT INTO sysbench.t1 VALUES (42); SELECT SLEEP(300); COMMIT;"
echo "Finished trx"
wait
```
What I see in the output is this:
```
Connected. Now start a 300 sec trx
mysql: [Warning] Using a password on the command line interface can be insecure.
[ 10s ] thds: 64 tps: 25507.66 qps: 510164.63 (r/w/o: 357116.76/102031.71/51016.17) lat (ms,95%): 2.66 err/s: 0.00 reconn/s: 0.00
[ 20s ] thds: 64 tps: 25847.74 qps: 516945.90 (r/w/o: 361864.36/103386.06/51695.48) lat (ms,95%): 2.91 err/s: 0.00 reconn/s: 0.00
[ 30s ] thds: 64 tps: 26320.21 qps: 526408.64 (r/w/o: 368486.50/105281.23/52640.91) lat (ms,95%): 2.52 err/s: 0.00 reconn/s: 0.00
[ 40s ] thds: 64 tps: 26347.44 qps: 526943.59 (r/w/o: 368859.92/105388.88/52694.79) lat (ms,95%): 3.19 err/s: 0.00 reconn/s: 0.00
[ 50s ] thds: 64 tps: 26634.93 qps: 532722.85 (r/w/o: 372903.48/106549.61/53269.75) lat (ms,95%): 2.48 err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 64 tps: 26869.96 qps: 537388.77 (r/w/o: 376178.82/107469.83/53740.12) lat (ms,95%): 2.52 err/s: 0.00 reconn/s: 0.00
[ 70s ] thds: 64 tps: 27157.88 qps: 543133.58 (r/w/o: 380187.71/108630.12/54315.76) lat (ms,95%): 2.43 err/s: 0.00 reconn/s: 0.00
[ 80s ] thds: 64 tps: 26855.13 qps: 537089.20 (r/w/o: 375956.25/107423.80/53709.15) lat (ms,95%): 2.52 err/s: 0.00 reconn/s: 0.00
[ 90s ] thds: 64 tps: 26735.39 qps: 534717.25 (r/w/o: 374306.32/106939.35/53471.57) lat (ms,95%): 2.76 err/s: 0.00 reconn/s: 0.00
[ 100s ] thds: 64 tps: 26582.93 qps: 531677.05 (r/w/o: 372177.08/106333.81/53166.15) lat (ms,95%): 2.81 err/s: 0.00 reconn/s: 0.00
[ 110s ] thds: 64 tps: 26450.17 qps: 529023.67 (r/w/o: 370318.43/105804.89/52900.35) lat (ms,95%): 2.81 err/s: 0.00 reconn/s: 0.00
[ 120s ] thds: 64 tps: 26226.84 qps: 524502.99 (r/w/o: 367147.52/104901.78/52453.69) lat (ms,95%): 3.19 err/s: 0.00 reconn/s: 0.00
[ 130s ] thds: 64 tps: 25983.53 qps: 519675.03 (r/w/o: 363772.94/103935.03/51967.06) lat (ms,95%): 3.55 err/s: 0.00 reconn/s: 0.00
[ 140s ] thds: 64 tps: 25623.54 qps: 512472.28 (r/w/o: 358731.85/102493.46/51246.98) lat (ms,95%): 3.75 err/s: 0.00 reconn/s: 0.00
[ 150s ] thds: 64 tps: 24741.04 qps: 494829.82 (r/w/o: 346381.48/98966.16/49482.18) lat (ms,95%): 4.03 err/s: 0.00 reconn/s: 0.00
[ 160s ] thds: 64 tps: 23893.99 qps: 477896.48 (r/w/o: 334528.61/95579.88/47787.99) lat (ms,95%): 4.41 err/s: 0.00 reconn/s: 0.00
[ 170s ] thds: 64 tps: 22728.83 qps: 454561.94 (r/w/o: 318196.45/90907.83/45457.66) lat (ms,95%): 4.91 err/s: 0.00 reconn/s: 0.00
[ 180s ] thds: 64 tps: 21527.08 qps: 430537.61 (r/w/o: 301372.36/86111.10/43054.15) lat (ms,95%): 5.28 err/s: 0.00 reconn/s: 0.00
[ 190s ] thds: 64 tps: 19028.23 qps: 380573.29 (r/w/o: 266402.89/76113.94/38056.47) lat (ms,95%): 5.88 err/s: 0.00 reconn/s: 0.00
[ 200s ] thds: 64 tps: 16423.98 qps: 328477.08 (r/w/o: 229932.01/65697.12/32847.96) lat (ms,95%): 6.43 err/s: 0.00 reconn/s: 0.00
[ 210s ] thds: 64 tps: 14091.65 qps: 281826.56 (r/w/o: 197281.87/56361.49/28183.20) lat (ms,95%): 7.04 err/s: 0.00 reconn/s: 0.00
[ 220s ] thds: 64 tps: 13382.14 qps: 267648.26 (r/w/o: 187354.93/53528.95/26764.38) lat (ms,95%): 7.30 err/s: 0.00 reconn/s: 0.00
[ 230s ] thds: 64 tps: 12575.21 qps: 251501.78 (r/w/o: 176050.83/50300.64/25150.32) lat (ms,95%): 7.56 err/s: 0.00 reconn/s: 0.00
[ 240s ] thds: 64 tps: 11448.50 qps: 228971.04 (r/w/o: 160277.96/45795.99/22897.09) lat (ms,95%): 7.98 err/s: 0.00 reconn/s: 0.00
SLEEP(300)
0
Finished trx
[ 250s ] thds: 64 tps: 27824.48 qps: 556530.68 (r/w/o: 389564.01/111317.72/55648.96) lat (ms,95%): 2.43 err/s: 0.00 reconn/s: 0.00
[ 260s ] thds: 64 tps: 28157.84 qps: 563127.70 (r/w/o: 394195.26/112616.86/56315.58) lat (ms,95%): 2.39 err/s: 0.00 reconn/s: 0.00
[ 270s ] thds: 64 tps: 28050.86 qps: 560995.59 (r/w/o: 392697.13/112196.64/56101.82) lat (ms,95%): 2.39 err/s: 0.00 reconn/s: 0.00
[ 280s ] thds: 64 tps: 27767.74 qps: 555374.38 (r/w/o: 388759.84/111079.06/55535.48) lat (ms,95%): 2.43 err/s: 0.00 reconn/s: 0.00
[ 290s ] thds: 64 tps: 27926.03 qps: 558520.19 (r/w/o: 390964.78/111703.44/55851.97) lat (ms,95%): 2.39 err/s: 0.00 reconn/s: 0.00
[ 300s ] thds: 64 tps: 27913.58 qps: 558184.41 (r/w/o: 390718.13/111645.42/55820.86) lat (ms,95%): 2.39 err/s: 0.00 reconn/s: 0.00
```
Observe how initially there is no big impact, but after around 3 minutes, we start to see the performance declining. Also, we see it rebound as soon as the long running transaction finishes.
At ~25k TPS and around 180 sec, we talk about approximately 4.5M transactions, and perhaps around 13.5-18M rows modified? I am not sure why that would be any important threshold but the measurements indicate some phase transition.
An alternative explanation would be that the slowdown is steadily raising over time, but is hidden by various other latencies in the system and only after 3mins becomes the main bottleneck and deciding factor.

I've also run the same experiment for unmodified trunk and my "origin/mysql-trunk-bug34972994-openended-nolist", and they don't experience such a slowdown.

Yes, long running transaction seems a big problem, I'll figure out how to fix the issue. 

a possible but trick way is to collect long-running transaction's id set (say long_running set), and advance up_limit_id without considering ids in the set.

so,background thread:
1. scan each shard, collect ids from the map if there's long-running rw transaction(say if it runs more than 10s), calculating min_id without considering the ids in set

2.
x lock:
update long-running set
update min_trx_id
x unlock

open read view:
s lock
copy the long-running set to local set
take min id as up_limit_id
s unlock
check transaction ids in local set, and remove id from set if it's not active anymore

changes_visible:
first check if id is in long-running set(if set is not empty), and then check up_limit_id

I'm not sure if it works. maybe we can use some kind of lock-free structures to store/remove long-running ids instead of introduce in a new rw lock

another way is possiblely calculating slow range and fast range in background:
[slow_up_limit_id, slow_low_limit_id) ....there's possible active slow trxs in the range
[fast_up_limit_id, fast_low_limit_id)

so when checking visibility
1. id < slow_up_limit_id, [slow_low_limit_id, fast_up_limit_id] can do fast checking
2. id in slow and fast range, need to do further checking

1. new strategy of searching undo header 2. remove spin_lock and use Seq_lock TBD: solve issue with long-running transaction (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0124.diff (application/octet-stream, text), 190.34 KiB.

fixed bug in ReadView::prepare that user ro session runs slow path to get low_limit_no (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0124.diff (application/octet-stream, text), 190.57 KiB.

fix long-running trx by picking them out and advance up_limit_id (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_fix_long_trx.diff (application/octet-stream, text), 197.76 KiB.

fix a bug of changes_visible (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0128.diff (application/octet-stream, text), 197.79 KiB.

Posted by developer:

Hello, thanks for new patches!

Looks like there are some things which could be simplified, which would help with reviewing it.
The smaller the change, and easier to analyse, the bigger chance we will try to use it.

1. IIUC lock_clust_rec_cons_read_sees doesn't really need pcur argument which you've introduced.
2. Your patch generates trx->id and trx->no from two different sequences (odd/even) using two different atomics, and storing maximum of them in system tablespace's header. I agree there's no fundamental reason these two have to come from the same monotonic sequence. However, it looks like some of the code in the patch is a left-over from initial solution in which the id stored in the roll pointer could be either "trx->id" or "trx->no", and thus you had to be able to recognize which of the two is it (is_scn(id)), and several functions even in the latest patch still seem to have if/else statements to handle both cases. But, AFAIU in the latest approach we never change the "type" of information in any place and always know what we are dealing with. I think the code could be simplified if it used a naming convention to distinguish the types and thus prune unreachable code. Also, I am not even sure if in the latest approach you still need the two sequences to be disjoint, etc. Maybe look at it with a fresh eye
3. I might be wrong, but looks like m_committing_scns is redundant: every trx->no added to it, had a corresponding trx->id added to m_invisible_ids first, and when checking for visibility you check m_invisible_ids first, so I don't see a case in which m_committing_scns provides additional info
4. This is somewhat inter-related with 3., but I am not sure why you had to modify various data structures like dict_index_t to have new fields for no/scn, in addition to the existing trx id field. AFAICT, your solution is able to answer visibility questions based on id alone, so (for example) why would we need to know the no/scn of trx which created the index?
5. I don't think the newly introduced `execute_no_latch` is needed - you can use, I think `peek()`
6. I have not checked, but have you seen Page_fetch::POSSIBLY_FREED and POSSIBLY_FREED_NO_READ_AHEAD modes? Perhaps you could reuse this, instead of introducing brand new ways of accessing the buffer pool? I don't have much experience with this mode, so I am not sure how it handles the case where whole Undo Tablespace was re-created (as it happens during "undo log truncation"), nor if it returns nullptr or a valid page pointer if page is freed.

Some things seem suboptimal:
1. I think some of the new variables declared trx_ids_set_t could actually be std::unordered_set's instead, so lookup would be O(1). For example, I think trx->in_long_set would not be needed, if you could just erase in O(1) time.
2. Up to you, but instead of writing 14 bytes, you can write 12, by using MLOG_WRITE_STRING for example, or by packing trx_id_t with hdr offset in one 8-byte field
3. I have not tested this hypothesis, but I am worried that while your m_long_running_ids trick helps the recent transactions avoid traversing the undo log chains, it in some sense reintroduces the original performance problem from mysql-trunk which is that read-view creation requires copying data (m_long_running_ids) from a central place. Yes, it is now less data, and you use s-latching instead of mutex, which is nice improvement, but I need to see how it scales, in particular when the background thread wants to acquire x-latch.

Some things look fishy:
1. IIUC you write the maximum of trx->id/trx->no to the system tablespace header, but you do not seem to hold any mutex while doing so. In trunk we either hold trx_sys->mutex or trx_sys->serialization mutex when obtaining next id/no, so if we are currently in the process of writing the max id to disc, threads from one of the two categories get blocked from incrementing it further. This leads to an important property in trunk, that the ids in use can't be larger than 2*MARGIN than the value persisted to the header. (Why? Because if the value on disc is X, then it means we could at most use values up to X+MARGIN before one of the threads notices the value being divisible by MARGIN, while under one of the two mutexes, the threads which use the other mutex, might still increment it further, but not further than X+MARGIN*2 before they also notice a value divisible by MARGIN. At this point none of the threads will be able to obtain new value until one of the two writer threads acquires latch on the system tablespace page, and writes the *current* value of the atomic to disc, which is guaranteed to be at least the value divisible by MARGIN which it saw). In your approach, I don't see the mutex acquistions, so I worry that threads my rush ahead quite a lot. In case InnoDB crashes, and then recovers, it might accidentally assign trx->id or trx->no which was already in use before the crash. That in turn can lead not only to assertion failures, but also to data loss, as we, for example, rollback wrong undo log records.
2. I am not sure your logic (or trunk's logic, even for Page_fetch::POSSIBLY_FREED) ever checks if the page read from disc is actually "freed" in the sense of being marked in the extend descriptor bitmap as freed. I am not sure it matters, neither, as what should really matter is if it has the right content, which you check. Still, I am not sure, what will happen later, if we let this page stay in the BP. Someone might want to reuse it, in which case, I am not sure if some asserts will fail or not. Also, not clear, if we properly add it to flush list/lru list, or should we even.
3. SCN_Mgr::take_up_ids seems to leave m_up_limit_id uninitialized in some edge cases

I am yet to analyse the changes to
- gtids persistence
- undo logging
- dictionary mgmt

I think, of the issues I've mentioned the one about persisting max trx id to system tablespace's header is the most important to clarify.

I think some ideas from this patch are interesting on their own, and could be incorporated as smaller changes to mysql-trunk.
For example, the idea to protect the set of active transactions ids with rw_lock_t instead of a mutex, opens up a possibility for creating read-views with s-latch instead of x-latch.
Also, the idea to use more fine grained mutexes for various parts.
Or the idea to use caching in a fixed-size hashmap (so there's no "dangling pointer" issues) is nice.
And in general, the idea that we could use data on disc as source of ground truth, and use in-memory data structures as a cache is promising.
Also interesting is the idea to have a background thread for optimizing the data structures used by foreground threads.
I am also looking forward to reviewing your changes to gtids persistence, as the interplay between it and purge process is a pain which causes some over-complication of mysql-trunk.

Hi,Jakub
thank you for your review and nice comment :)

My answer to your comment:

1. yes, changes to lock_clust_rec_cons_read_sees should be reverted, it's introduced by old patch

2. Possiblely with new approach it's not necessary to distinguish scn and no. many code can be simplified. I need to look into it.

3. yes, you are right, it's introduced by old approach

4. I need to check dict_index_t::is_usable. I'm not familiar with DDL code, modifying is aim to pass some case

5.execute_no_latch is needed on shutdown, ref: trx_sys_close

6. Yes POSSIBLY_FREED can be used. With old approach it may try to read truncated page by background thread, but as the threads are all removed now, I think We can use this flag to read undo safely

Some things seem suboptimal:
1. trx_ids_set_t of read view/SCN_Mgr need to be ordered, so it can do a quick checking in changes_visible. trx_ids_set_t in trx_shard can use std::unordered_set

2. I think MLOG_WRITE_STRING is not that clear from the side of reading code. more two bytes is not a big deal

3. Yes, if long running transactions are added/removed very very frequently, the background thread will need to update its local cache frequently, This is the worst case, may be we can control the frequency of updating local cache, or use some lock-free structure ?

Some things look fishy:

1. you're right, there's a potential risk of losing new ids. I need to look into how to fix it without adding new mutexes

2. I think it should be fine, a freed page will firstly init it before using it

3. thank you, you are right :)

new version, comment in bellow (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0201.diff (application/octet-stream, text), 173.79 KiB.

changes of latest patch:
1. some set is changed to unorderded_set
2. use Page_fetch::POSSIBLY_FREED to read undo page
3. use one counter for id and scn 
4. remove m_committing_scns
5. remove index scn
6. revert changes of lock_clust_rec_cons_read_sees
7. change the way of storing max_trx_id_or_no
   * whenever a transaction is commit and adding to history list, store max trx id     
     in rseg header
   * when a undo space is truncated, store max trx id in sys header
   * on startup, resurrect max trx id from rseg header, and uncommitted transaction id + 1, 
   * choose max value between resurrected id and id stored in sys header
    

4.

bugfix (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0203.diff (application/octet-stream, text), 173.77 KiB.

bugfix: trx_purge_t::min_up_id is not correctly calculated (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0204.diff (application/octet-stream, text), 174.21 KiB.

bugfix:view_open:: should set view slot before prepare (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0204_2.diff (application/octet-stream, text), 174.21 KiB.

patch based on previous one and add history list for insert undo (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: insert_history_list.diff (application/octet-stream, text), 10.01 KiB.

Benefit of adding new history list for insert undo:
1. it uses unused space before to create histroy list
2. purge thread will not need to parse insert undo
3. Server can rollback to older version

add new read only variable innodb_mvcc_use_scn, true by default (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0208.diff (application/octet-stream, text), 178.87 KiB.

1. some bugfix 2. parallel reading use traditional mvcc (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0210.diff (application/octet-stream, text), 181.15 KiB.

bugfix (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0211.diff (application/octet-stream, text), 181.49 KiB.

remove scn_mutex, and bugfix (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0217.diff (application/octet-stream, text), 178.00 KiB.

remove unnecessary debug checking in ReadView::copy_trx_ids (*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: scn_0219.diff (application/octet-stream, text), 179.35 KiB.