MySQL Bugs: #94553: Crash in trx_undo_rec

Bug #94553	Crash in trx_undo_rec_copy
Submitted:	4 Mar 2019 18:15	Modified:	21 Sep 2020 12:43
Reporter:	Bernardo Perez	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	5.7.16	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
This is related to:
https://bugs.mysql.com/bug.php?id=82734
And in particular to: 
https://bugs.mysql.com/bug.php?id=82736

This results on the system trying to allocate close to an exabyte of memory. 

I could clearly see on step 11 (by printing "index->table->name") that the table being purged is using virtual generated columns. 

#0  0x00002b4f27f805f7 in raise () from /lib64/libc.so.6
#1  0x00002b4f27f81ce8 in abort () from /lib64/libc.so.6
#2  0x000000000077f4c2 in ut_dbg_assertion_failed (expr=expr@entry=0x15dd3f3 "!m_fatal", file=file@entry=0x15dcc30 "/local/mysql-5.7.16.R2/storage/innobase/ut/ut0ut.cc", line=line@entry=931) at /local/mysql-5.7.16.R2/storage/innobase/ut/ut0dbg.cc:67
#3  0x0000000001075113 in ib::fatal_or_error::~fatal_or_error (this=0x2b6814a00ab0, __in_chrg=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/ut/ut0ut.cc:931
#4  0x0000000000f5390c in allocate (throw_on_error=false, set_to_zero=false, file=0x15c79f0 "/local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc", hint=0x0, n_elements=18446744073709544272, this=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/include/ut0new.h:363
#5  mem_heap_create_block_func (heap=heap@entry=0x2b6774c2eb18, n=<optimized out>, type=0) at /local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc:302
#6  0x0000000000f53c7c in mem_heap_add_block (heap=0x2b6774c2eb18, n=n@entry=18446744073709544154) at /local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc:408
#7  0x0000000000f53e2c in mem_heap_alloc (n=18446744073709544154, heap=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/include/mem0mem.ic:204
#8  mem_heap_dup (heap=<optimized out>, data=0x2b4f2cbe9d26, len=18446744073709544154) at /local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc:59
#9  0x00000000010471e3 in trx_undo_rec_copy (heap=0x2b6774c2eb18, undo_rec=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/include/trx0rec.ic:99
#10 trx_undo_get_undo_rec_low (roll_ptr=<optimized out>, heap=heap@entry=0x2b6774c2eb18, is_redo_rseg=is_redo_rseg@entry=true) at /local/mysql-5.7.16.R2/storage/innobase/trx/trx0rec.cc:2124
#11 0x0000000001047663 in trx_undo_prev_version_build (index_rec=index_rec@entry=0x2b4f2dbd2246 "\233VqJ", index_mtr=index_mtr@entry=0x2b6814a01970, rec=0x2b4f2dbd2246 "\233VqJ", index=index@entry=0x2b6773d18188, offsets=offsets@entry=0x2b6774c2f088, heap=<optimized out>, old_vers=old_vers@entry=0x2b6814a013c0, v_heap=v_heap@entry=0x0,
    vrow=vrow@entry=0x2b6814a018d0, v_status=v_status@entry=3) at /local/mysql-5.7.16.R2/storage/innobase/trx/trx0rec.cc:2258
#12 0x0000000001007b7d in row_vers_vc_matches_cluster (in_purge=in_purge@entry=true, rec=rec@entry=0x2b4f2dbd2246 "\233VqJ", row=row@entry=0x2b6774c40888, ext=<optimized out>, clust_index=clust_index@entry=0x2b6773d18188, clust_offsets=clust_offsets@entry=0x2b6774c2f088, index=index@entry=0x2b6773d1ae88, ientry=ientry@entry=0x2b6774c2e688,
    roll_ptr=roll_ptr@entry=23925402013206650, trx_id=trx_id@entry=41139801512, vrow=vrow@entry=0x2b6814a018d0, mtr=mtr@entry=0x2b6814a01970, v_heap=0x0) at /local/mysql-5.7.16.R2/storage/innobase/row/row0vers.cc:702
#13 0x0000000001009599 in row_vers_old_has_index_entry (also_curr=also_curr@entry=1, rec=0x2b4f2dbd2246 "\233VqJ", mtr=mtr@entry=0x2b6814a01970, index=index@entry=0x2b6773d1ae88, ientry=ientry@entry=0x2b6774c2e688, roll_ptr=23925402013206650, trx_id=41139801512) at /local/mysql-5.7.16.R2/storage/innobase/row/row0vers.cc:947
#14 0x0000000000fdbb50 in row_purge_poss_sec (node=0x2b6714b88488, index=index@entry=0x2b6773d1ae88, entry=entry@entry=0x2b6774c2e688) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:264
#15 0x0000000001094f49 in btr_cur_search_to_nth_level (index=index@entry=0x2b6773d1ae88, level=level@entry=0, tuple=tuple@entry=0x2b6774c2e688, mode=mode@entry=PAGE_CUR_LE, latch_mode=2, latch_mode@entry=8194, cursor=cursor@entry=0x2b6814a02e40, has_search_latch=has_search_latch@entry=0,
    file=file@entry=0x15d1010 "/local/mysql-5.7.16.R2/storage/innobase/row/row0row.cc", line=line@entry=1073, mtr=0x2b6814a030a0, mtr@entry=0x2b6773d1ae88) at /local/mysql-5.7.16.R2/storage/innobase/btr/btr0cur.cc:1152
#16 0x0000000000fe0e0f in btr_pcur_open_low (level=0, mode=PAGE_CUR_LE, file=0x15d1010 "/local/mysql-5.7.16.R2/storage/innobase/row/row0row.cc", mtr=0x2b6773d1ae88, line=1073, cursor=0x2b6814a02e40, latch_mode=8194, tuple=0x2b6774c2e688, index=0x2b6773d1ae88) at /local/mysql-5.7.16.R2/storage/innobase/include/btr0pcur.ic:465
#17 row_search_index_entry (index=index@entry=0x2b6773d1ae88, entry=entry@entry=0x2b6774c2e688, mode=mode@entry=8194, pcur=pcur@entry=0x2b6814a02e40, mtr=mtr@entry=0x2b6814a030a0) at /local/mysql-5.7.16.R2/storage/innobase/row/row0row.cc:1073
#18 0x0000000000fdcee5 in row_purge_remove_sec_if_poss_leaf (node=node@entry=0x2b6714b88488, index=index@entry=0x2b6773d1ae88, entry=entry@entry=0x2b6774c2e688) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:471
#19 0x0000000000fde098 in row_purge_remove_sec_if_poss (entry=0x2b6774c2e688, index=0x2b6773d1ae88, node=0x2b6714b88488) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:587
#20 row_purge_upd_exist_or_extern_func (node=node@entry=0x2b6714b88488, undo_rec=undo_rec@entry=0x2b6813c61c28 "\tq\f\003\220", <incomplete sequence \330>) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:709
#21 0x0000000000fde5ea in row_purge_record_func (updated_extern=<optimized out>, undo_rec=0x2b6813c61c28 "\tq\f\003\220", <incomplete sequence \330>, node=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:987
#22 row_purge (thr=0x2b6814a03c40, undo_rec=<optimized out>, node=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:1031
#23 row_purge_step (thr=thr@entry=0x2b4f28da0458) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:1110
#24 0x0000000000f8eb29 in que_thr_step (thr=0x6) at /local/mysql-5.7.16.R2/storage/innobase/que/que0que.cc:1056
#25 que_run_threads_low (thr=0x6) at /local/mysql-5.7.16.R2/storage/innobase/que/que0que.cc:1118
#26 que_run_threads (thr=thr@entry=0x2b4f28da0458) at /local/mysql-5.7.16.R2/storage/innobase/que/que0que.cc:1158
#27 0x0000000001013655 in srv_task_execute () at /local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc:2458
#28 srv_worker_thread (arg=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc:2508
#29 0x00002b4f26accdc5 in start_thread () from /lib64/libpthread.so.0
#30 0x00002b4f28041c9d in clone () from /lib64/libc.so.6/

How to repeat:
Exercise InnoDB crash recovery while generated columns based on JSON data are in use.

samples of the types of columns we are using:

`test1` timestamp GENERATED ALWAYS AS (json_unquote(json_extract(`data_json`,'$.test1'))) VIRTUAL NULL
`test2` timestamp GENERATED ALWAYS AS (json_unquote(json_extract(`data_json`,'$.test2'))) VIRTUAL NULL
`test3` int(11) GENERATED ALWAYS AS (json_unquote(json_extract(`data_json`,'$.test3'))) VIRTUAL
`test4` int(11) GENERATED ALWAYS AS (json_unquote(json_extract(`data_json`,'$.test4'))) VIRTUAL
`test5` tinyint(1) GENERATED ALWAYS AS (json_unquote(json_extract(`data_json`,'$.test5))) VIRTUAL
`test6` int(11) GENERATED ALWAYS AS (json_unquote(json_extract(`data_json`,'$.test6'))) VIRTUAL
`test7` float GENERATED ALWAYS AS (json_unquote(json_extract(`data_json`,'$.test7'))) VIRTUAL
`test8` float GENERATED ALWAYS AS (json_unquote(json_extract(`data_json`,'$.test8'))) VIRTUAL

Suggested fix:
It seems to be mentioned on:
https://bugs.mysql.com/bug.php?id=82734

I've been scanning through the code and the release versions and could not find anything pointing to a fix on this issue.

Could you point if this has been fixed at any given point on a specific (minor/mayor) version?

Hi,

Problem is most probably in your settings or in the engine that you used.

So, for the start check all your settings, including all those that lead to memory allocations, like default storage type, like all internal_tmp* settings, tmp_table_size, buffer pool size etc ......

Also, check whether you have created HEAP or MEMORY table with many rows.

Hello Sinisa, Thanks for your reply.

Are you suggesting that our settings are trying to allocate 1 hexabyte of memory when doing an internal rollback?

Could you please point me to what settings do you want me to check that validate the problem?

I've attached the my.cnf of the instance. The rest of the parameters not present here are default mysql parameters. 

Some clarifications:

1) Okay so, this is a snipet of the error log: 

[ERROR] InnoDB: Cannot allocate 18446744073709540048 bytes of memory after 60 retries over 60 seconds. OS error: Cannot allocate memory (12). Check if you should increase the swap file or ulimits of your operating system. Note that on most 32-bit computers the process memory space is limited to 2 GB or 4 GB.

2) We can consistently reproduce when restoring from a specific physical backup that has an inflight transaction. This database has NO activity NOR connections running, just the purge thread rolling back a transaction.

3) We have also experienced this issue on other scenarios that usually would involve rolling back transactions in flight. 

4) Our system has 240GB of memory. We can't see any parameter than can consume 18446744073709540048 bytes of memory

5) Just in case this is not clear enough and reiterating the message: There are no connections to the engine, the only thing going on is the purge thread. How can the purge thread request such an amount of memory?

Hi,

Your my.cnf seems OK. However, you use strings for numeric values, so fix this.

Next, please check if you are running 32-bit binary on 64-bit system. Or the other way around.

Last, but not least, please try 5.7.25, since so many similar bugs have been fixed interim.

Hello Sinisa, thanks again for your reply. Regarding your comments:

You mentioned:

1) "However, you use strings for numeric values, so fix this."

Would you mind to elaborate a bit more on what you mean on this point?

2) "Next, please check if you are running 32-bit binary on 64-bit system. Or the other way around."

We are using a 64 bit binary for a 64 bit system. 

3) "Last, but not least, please try 5.7.25, since so many similar bugs have been fixed interim."

Something similar to this has been suggested on https://bugs.mysql.com/bug.php?id=82736 when the first report was done in 5.7.11 and the suggestions where to upgrade to .14/.15 and here we are in .16 with the same issue.

I've personally scanned all the release notes up to .25 and can't find any reference to this.

If I am suggested to upgrade to a version, with the impact that this may bring to my systems, would you mind to please pinpoint the release notes were this specific issue is fixed AND the commit with the changes?

Thanks in advance.

Hi,

Regarding numeric variables, you should not use quotes, but numbers like this:

some_numeric_variable=1024

Hence, no quotes !!

Regarding upgrading to latest version. Actually, no you do not have to do that, at all !!!!

This is not a free support site, but a forum for the repeatable bug reports.

So, regardless of the version that you are using we need a fully repeatable test case in order to be able to verify the bug.

Fully repeatable test case is a set of commands and statements that leads to the problem that you described.

Thanks in advance for the test case. When we get it, we shall try to repeat it.

Hello Sinisa. Thanks again for the reply.

We will take into consideration your advice regarding the variables.

Back the bug report:

1) I can see that on the other bug report I provided https://bugs.mysql.com/bug.php?id=82734 were I directly pointed that the code is still the same in 8.0 you verified the issue as:

"This is now a fully verified feature request."

That report explicitly correlates with the issue we are experiencing.

2) Regarding the fully repeatable case we already provided one. And it was also provided one in bug report https://bugs.mysql.com/bug.php?id=82736

"Exercise InnoDB crash recovery while generated columns based on JSON data are in use."

If this is not enough, I am happy to GDB and debug whatever you request to validate the problem if the full backtrace is not enough.

Hi,

Bug #82736 does not have ANY repeatable test case. Nor does this bug report has any repeatable test case.

They both have stacktraces. This does not help us at all.

What we need, in your case, is CREATE TABLE statement, INSERT statements that will insert sufficient number of rows to repeat the problem. Next, you should let us know how to crash the server, in which moment, exactly and what kind of crash recovery have you done. From backup, or was it automatic crash recovery by InnoDB storage engine. We must be provided with all info, so that we can repeat the problem. 

We can verify the bug, only if we can repeat the problem.

We also accept bugs based on the code analysis, so your report # 82734, it is fully verified, copied into our internal bug repository and will await for the scheduling.

If you are 100 % sure that the code enhancement in #82734 causes this bug, then we will close this bug as a duplicate and will check whether #82734 can be turned into a bug with severity higher then S4.

Hello Sinisa, thanks for your updates.

I've been trying to debug with GDB on the process when this happens.

I can clearly see that when the call for the memory allocation happens this comes from the value of "len"

(gdb) f 4
#4  mem_heap_dup (heap=<optimized out>, data=0x2ab8fbbe9d26, len=18446744073709544154) at /local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc:59
59	/local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc: No such file or directory.
(gdb) p undo_rec
No symbol "undo_rec" in current context.
(gdb) p len
$30 = 18446744073709544154

This value of len comes from:

#5  0x00000000010471e3 in trx_undo_rec_copy (heap=0x2ae98a42f018, undo_rec=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/include/trx0rec.ic:99

This call, in the code, is the part of the code mentioned on #82734

 84 /***********************************************************************//**
 85 Copies the undo record to the heap.
 86 @return own: copy of undo log record */
 87 UNIV_INLINE
 88 trx_undo_rec_t*
 89 trx_undo_rec_copy(
 90 /*==============*/
 91         const trx_undo_rec_t*   undo_rec,       /*!< in: undo log record */
 92         mem_heap_t*             heap)           /*!< in: heap where copied */
 93 {
 94         ulint           len;
 95
 96         len = mach_read_from_2(undo_rec)      ###############################--> Here len is calculated
 97                 - ut_align_offset(undo_rec, UNIV_PAGE_SIZE);
 98         ut_ad(len < UNIV_PAGE_SIZE);
 99         return((trx_undo_rec_t*) mem_heap_dup(heap, undo_rec, len));     --> Here is used for the memory duplication call
100 }

Is this enough information to validate this as a bug or more information is needed?

Full bt for reference:
(gdb) bt
#0  allocate (throw_on_error=false, set_to_zero=false, file=0x15c79f0 "/local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc", hint=0x0, n_elements=18446744073709544272, this=<optimized out>)
    at /local/mysql-5.7.16.R2/storage/innobase/include/ut0new.h:363
#1  mem_heap_create_block_func (heap=heap@entry=0x2ae98a42f018, n=<optimized out>, type=0) at /local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc:302
#2  0x0000000000f53c7c in mem_heap_add_block (heap=0x2ae98a42f018, n=n@entry=18446744073709544154) at /local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc:408
#3  0x0000000000f53e2c in mem_heap_alloc (n=18446744073709544154, heap=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/include/mem0mem.ic:204
#4  mem_heap_dup (heap=<optimized out>, data=0x2ab8fbbe9d26, len=18446744073709544154) at /local/mysql-5.7.16.R2/storage/innobase/mem/mem0mem.cc:59
#5  0x00000000010471e3 in trx_undo_rec_copy (heap=0x2ae98a42f018, undo_rec=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/include/trx0rec.ic:99
#6  trx_undo_get_undo_rec_low (roll_ptr=<optimized out>, heap=heap@entry=0x2ae98a42f018, is_redo_rseg=is_redo_rseg@entry=true) at /local/mysql-5.7.16.R2/storage/innobase/trx/trx0rec.cc:2124
#7  0x0000000001047663 in trx_undo_prev_version_build (index_rec=index_rec@entry=0x2ab8fd5fe246 "\233VqJ", index_mtr=index_mtr@entry=0x2aeac7a01970, rec=0x2ab8fd5fe246 "\233VqJ",
    index=index@entry=0x2ae9893b6188, offsets=offsets@entry=0x2ae98a42f588, heap=<optimized out>, old_vers=old_vers@entry=0x2aeac7a013c0, v_heap=v_heap@entry=0x0, vrow=vrow@entry=0x2aeac7a018d0,
    v_status=v_status@entry=3) at /local/mysql-5.7.16.R2/storage/innobase/trx/trx0rec.cc:2258
#8  0x0000000001007b7d in row_vers_vc_matches_cluster (in_purge=in_purge@entry=true, rec=rec@entry=0x2ab8fd5fe246 "\233VqJ", row=row@entry=0x2ae98a440888, ext=<optimized out>,
    clust_index=clust_index@entry=0x2ae9893b6188, clust_offsets=clust_offsets@entry=0x2ae98a42f588, index=index@entry=0x2ae9893b8e88, ientry=ientry@entry=0x2ae98a42e688,
    roll_ptr=roll_ptr@entry=23925402013206650, trx_id=trx_id@entry=41139801512, vrow=vrow@entry=0x2aeac7a018d0, mtr=mtr@entry=0x2aeac7a01970, v_heap=0x0)
    at /local/mysql-5.7.16.R2/storage/innobase/row/row0vers.cc:702
#9  0x0000000001009599 in row_vers_old_has_index_entry (also_curr=also_curr@entry=1, rec=0x2ab8fd5fe246 "\233VqJ", mtr=mtr@entry=0x2aeac7a01970, index=index@entry=0x2ae9893b8e88,
    ientry=ientry@entry=0x2ae98a42e688, roll_ptr=23925402013206650, trx_id=41139801512) at /local/mysql-5.7.16.R2/storage/innobase/row/row0vers.cc:947
#10 0x0000000000fdbb50 in row_purge_poss_sec (node=0x2aea98647488, index=index@entry=0x2ae9893b8e88, entry=entry@entry=0x2ae98a42e688) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:264
#11 0x0000000001094f49 in btr_cur_search_to_nth_level (index=index@entry=0x2ae9893b8e88, level=level@entry=0, tuple=tuple@entry=0x2ae98a42e688, mode=mode@entry=PAGE_CUR_LE, latch_mode=2,
    latch_mode@entry=8194, cursor=cursor@entry=0x2aeac7a02e40, has_search_latch=has_search_latch@entry=0, file=file@entry=0x15d1010 "/local/mysql-5.7.16.R2/storage/innobase/row/row0row.cc",
    line=line@entry=1073, mtr=0x2aeac7a030a0, mtr@entry=0x2ae9893b8e88) at /local/mysql-5.7.16.R2/storage/innobase/btr/btr0cur.cc:1152
#12 0x0000000000fe0e0f in btr_pcur_open_low (level=0, mode=PAGE_CUR_LE, file=0x15d1010 "/local/mysql-5.7.16.R2/storage/innobase/row/row0row.cc", mtr=0x2ae9893b8e88, line=1073, cursor=0x2aeac7a02e40,
    latch_mode=8194, tuple=0x2ae98a42e688, index=0x2ae9893b8e88) at /local/mysql-5.7.16.R2/storage/innobase/include/btr0pcur.ic:465
#13 row_search_index_entry (index=index@entry=0x2ae9893b8e88, entry=entry@entry=0x2ae98a42e688, mode=mode@entry=8194, pcur=pcur@entry=0x2aeac7a02e40, mtr=mtr@entry=0x2aeac7a030a0)
    at /local/mysql-5.7.16.R2/storage/innobase/row/row0row.cc:1073
#14 0x0000000000fdcee5 in row_purge_remove_sec_if_poss_leaf (node=node@entry=0x2aea98647488, index=index@entry=0x2ae9893b8e88, entry=entry@entry=0x2ae98a42e688)
    at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:471
#15 0x0000000000fde098 in row_purge_remove_sec_if_poss (entry=0x2ae98a42e688, index=0x2ae9893b8e88, node=0x2aea98647488) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:587
#16 row_purge_upd_exist_or_extern_func (node=node@entry=0x2aea98647488, undo_rec=undo_rec@entry=0x2aeac6c61c28 "\tq\f\003\220", <incomplete sequence \330>)
    at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:709
#17 0x0000000000fde5ea in row_purge_record_func (updated_extern=<optimized out>, undo_rec=0x2aeac6c61c28 "\tq\f\003\220", <incomplete sequence \330>, node=<optimized out>)
    at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:987
#18 row_purge (thr=0x2aeac7a03c40, undo_rec=<optimized out>, node=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:1031
#19 row_purge_step (thr=thr@entry=0x2ab8f71a0458) at /local/mysql-5.7.16.R2/storage/innobase/row/row0purge.cc:1110
#20 0x0000000000f8eb29 in que_thr_step (thr=0x6) at /local/mysql-5.7.16.R2/storage/innobase/que/que0que.cc:1056
#21 que_run_threads_low (thr=0x6) at /local/mysql-5.7.16.R2/storage/innobase/que/que0que.cc:1118
#22 que_run_threads (thr=thr@entry=0x2ab8f71a0458) at /local/mysql-5.7.16.R2/storage/innobase/que/que0que.cc:1158
#23 0x0000000001013655 in srv_task_execute () at /local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc:2458
#24 srv_worker_thread (arg=<optimized out>) at /local/mysql-5.7.16.R2/storage/innobase/srv/srv0srv.cc:2508
#25 0x00002ab8f4ff7dc5 in start_thread () from /lib64/libpthread.so.0
#26 0x00002ab8f656cc9d in clone () from /lib64/libc.so.6

Hi Mr. Perez,

Thank you for your efforts, but, again we need to be able to get the same problem, ourselves, with a huge value for that length field.

Hence, we need a fully repeatable test case.

What we need, in your case, is CREATE TABLE statement, INSERT statements
that will insert sufficient number of rows to repeat the problem. Next,
you should let us know how to crash the server, in which moment, exactly
and what kind of crash recovery have you done. From backup, or was it
automatic crash recovery by InnoDB storage engine. We must be provided
with all info, so that we can repeat the problem. 

We can verify the bug, only if we can repeat the problem.

However, iff you are 100 % sure that the code enhancement in #82734 causes this bug, then we will close this bug as a duplicate, because we were able  #82734 to turn it into a bug with higher severity. That bug report contains a code analysis that is thorough and does not just inform us of the value of some variable or field.

Hello Sinisa,

I understand that you want a fully repeatable test case. I would love to provide it to you. In parallel to the crash analysis, I am trying to do so but it is proving challenging to reproduce in an isolated case (I have server that we cant create a replica through a disk backup because its live traffic is causing this all the time).

However and in the other hand, I have shown you the explicit line were this happens on the code, in execution time when the system crashes. And I also provided a full batcktrace with all the code calls to get here. That should be also valid as an evidence as code analysis as it also the same line of code mentioned on #82734.

Hi,

I have tried to repeat your crash. I debugged the code and set a breakpoint exactly where that `len` gets a very high value. Could not repeat it.

Again, if you think that your report is a duplicate of the bug #82734, I will be happy to denote it as a duplicate. 

There is nothing else we can do in this case. Simply, developers do require a fully repeatable test case in order to be able to see the error in the code that causes a behaviour described in (any) bug.

Can't repeat .....

I already wrote that stacktraces are NOT test cases as the cause of the bug could be thousands of lines before the problem happens.

Hello Sinisa,

I know for you If I don't provide a test case you don't value the code analysis and the debug but this bug is quite painful for us and I really think that should be taken into account if hasn't been fixed yet so I would really appreciate if you take the time to read all this update with detail. 

I've been trying to reproduce this and then I could see that the behavior that I was seeing on the DB's that suffer the problem during recovery was not similar to my repros. 

I initially thought that the problem was related to a simple transaction rollback and the undo failing during the rollback but after a deeper look into the code it is far more complex than simple generating transactions and then rolling back. 

The problem happens when the engine is trying to purge a row that has a secondary index over a generated column, in particular, when trying to access the previous version of the secondary index of that generated column. When trying to print the previous version with gdb, I get an out of bounds address.

#8  0x0000000001007b7d in row_vers_vc_matches_cluster (in_purge=in_purge@entry=true, rec=rec@entry=0x2b60364a2246 "\233VqJ", row=row@entry=0x2b9200872c88, ext=<optimized out>, clust_index=clust_index@entry=0x2b90c2fbdf08, clust_offsets=clust_offsets@entry=0x2b9200830088, index=index@entry=0x2b90c2fc0e88, ientry=ientry@entry=0x2b920082f688,roll_ptr=roll_ptr@entry=23925402013206897, trx_id=trx_id@entry=41139801512, vrow=vrow@entry=0x2b92014007f0, mtr=mtr@entry=0x2b9201400890, v_heap=0x0) at /local/mysql-5.7.16.R2/storage/innobase/row/row0vers.cc:702

(gdb) p *&prev_version
$29 = (rec_t *) 0x0

(gdb) x/1s 0x0
0x0:	<Address 0x0 out of bounds>

Code related to this from row0vers.cc:
700                 trx_undo_prev_version_build(
701                         rec, mtr, version, clust_index, clust_offsets,
702                         heap, &prev_version, NULL, vrow, status);

Detailed. Code calls and events in order of execution:

## In this calls we can see the purge row calls happening. 
1) row_purge_step -> row_purge(node, purge_rec->undo_rec, thr); in #row0purge.cc:1110
2) row_purge ->  bool purged = row_purge_record(node, undo_rec, thr, updated_extern); in #row0purge.cc:1031
3) row_purge_record_func -> row_purge_upd_exist_or_extern(thr, node, undo_rec); in #row0purge.cc:987
4) row_purge_upd_exist_or_extern_func -> row_purge_remove_sec_if_poss(node, node->index, entry); in #row0purge.cc:709

## In this next 2 calls is were the call for the removal of the secondary index happens
5) row_purge_remove_sec_if_poss -> if (row_purge_remove_sec_if_poss_leaf(node, index, entry)) {return;} in #row0purge.cc:587
6) row_purge_remove_sec_if_poss_leaf -> search_result = row_search_index_entry(ndex, entry, mode, &pcur, &mtr); in #row0purge.cc:471

## The next 3 calls seem related to traverse the index finding the tuple.
7) row_search_index_entry -> btr_pcur_open(index, entry, PAGE_CUR_LE, mode, pcur, mtr); in #row0row.cc:1073
8) btr_pcur_open_low ->  btr_cur_search_to_nth_level(index, level, tuple, mode, latch_mode,btr_cursor, 0, file, line, mtr); in #btr0pcur.ic:465
9) btr_cur_search_to_nth_level -> if (!row_purge_poss_sec(cursor->purge_node,ndex, tuple)) in #btr0cur.cc:1152

## In the next calls we can see how it will try to identify the index entry and rebuild it from a previous version. 

10) row_purge_poss_sec in #row0purge.cc:264
    260         can_delete = !row_purge_reposition_pcur(BTR_SEARCH_LEAF, node, &mtr)
    261                 || !row_vers_old_has_index_entry(TRUE,
    262                                                  btr_pcur_get_rec(&node->pcur),
    263                                                  &mtr, index, entry,
    264                                                  node->roll_ptr, node->trx_id);

11) row_vers_old_has_index_entry ->  in #row0vers.cc:947
    947                                 if (row_vers_vc_matches_cluster(
    948                                         also_curr, rec, row, ext, clust_index,
    949                                         clust_offsets, index, ientry, roll_ptr,
    950                                         trx_id, NULL, &vrow, mtr)) {
    951                                         mem_heap_free(heap);

## In the next call is when I can see with gdb that &prev_version is containing an address that is out of bounds. 
12) row_vers_vc_matches_cluster -> in #row0vers.cc:702

    700                 trx_undo_prev_version_build(
    701                         rec, mtr, version, clust_index, clust_offsets,
    702                         heap, &prev_version, NULL, vrow, status);

(gdb) p *&prev_version
$29 = (rec_t *) 0x0

(gdb) x/1s 0x0
0x0:	<Address 0x0 out of bounds>

13) trx_undo_prev_version_build -> in #trx0rec.cc:2258
When looking into trx_undo_prev_version_build we can see that &prev_version = rec_t * and this is defined by trx_undo_prev_version_build as:

2198         rec_t**         old_vers,/*!< out, own: previous version, or NULL if
2199                                 rec is the first inserted version, or if
2200                                 history data has been deleted (an error),
2201                                 or if the purge COULD have removed the version
2202                                 though it has not yet done so */

This should never be an address out of bound. In the worst case scenario should be a NULL if it was deleted.

For what I can understand, seems that the code is trying to access an undo record that it does not exist anymore and that is why it is returning "Address out of bounds". 

Something cleaned up the undo record before the secondary index was able to update it. I've found a similar behavior but in a different circumstance (temporary tables) on this Jira from MariaDB. https://jira.mariadb.org/browse/MDEV-15374 

In the JIRA Thirunarayanan Balathandayuthapani mentions at the very end:

Problem:
InnoDB cleans all temporary undo logs during commit. During rollback of secondary index entry, InnoDB tries to build the previous version of clustered index.
It leads to access of freed undo page during previous transaction commit and it leads to undo log corruption.

This behavior seems to match what I see while debugging but I am uncertain on how to reproduce this behavior for generated columns. Maybe after this analysis you can understand why is being hard for me to reproduce and I maybe can help you find a way to reproduce it yourself as you may have a deeper understanding on how the undo could be freed in this scenario.

Hi,

I understand that this issue is producing lots of problems for you. However, ......

We still need a reproducible  test case or a full-fledged code analysis.

Still, since you insist, I will consult the engineers in Development who maintain the code and may be they spot something that might help them locate a problem. But, if this is not sufficient info even for them, then we can only wait for test case or code analysis.

Hi Mr. Perez,

Unfortunately, I have bad news. Several of our developers analysed all that you wrote and, in short, came up with the following conclusions:

--------------------------------------------------------------------

It seems that the undo log is corrupted.  How that happens would be only a guess.

The real crash happens in this function;

/** Copies the undo record to the heap.
 @return own: copy of undo log record */
UNIV_INLINE
trx_undo_rec_t *trx_undo_rec_copy(
    const trx_undo_rec_t *undo_rec, /*!< in: undo log record */
    mem_heap_t *heap)               /*!< in: heap where copied */
{
  ulint len;

  len = mach_read_from_2(undo_rec) - ut_align_offset(undo_rec, UNIV_PAGE_SIZE);
  ut_ad(len < UNIV_PAGE_SIZE);
  return ((trx_undo_rec_t *)mem_heap_dup(heap, undo_rec, len));
}

So it looks like mach_read_from_2(undo_rec) returned the value of 1 from the undo log page.

So why was this len value set to 1 in the undo log?

We simply can not do anything without a repeatable test case.

--------------------------------------------------------------------

Hence, we truly require a repeatable test case.

Thanks in advance.

If anyone hits this problem. Seems that the commit that fix this issue could be this one:

https://github.com/mysql/mysql-server/commit/1dec14d346ac55fe72989dccb071f84b3b0d3bd6#diff...

Is there a reason why this particular commit from MySQL 8 has not been merged to MySQL 5.7.

We have reason to believe that this bug is caused by undo log segments being reused by other transactions even though these are needed by other undo logs processed by the purge thread.

https://github.com/mysql/mysql-server/commit/19746679399649c1898ff165e51edaf5229b89c0 - This is the comment i am referring to

We did a investigation into this incident by parsing the undo log and redo logs of the crashed server. We believe this commit in 5.7.18 fixes it: https://github.com/mysql/mysql-server/commit/173e171c6fb55f064eea278c76fbb28e2b1c757b

Here is a summary:
1. There is a delete followed by an immediate insert to the same key on a table which has a index on generated columns.
2. The commit trx no according to the redo log are:
Delete : 2921666617
Insert : 2921666622
3. The insert was translated to a update. As a result both insert and delete used update undo logs.
4. Update undo pages are allocated to both delete and insert.
5. Delete was allocated a rseg = 70 and undo page = 9127
6. Insert was allocated a rseg = 74 and undo page = 556.
7. For the undo page which had insert (556)
The history length for the page reaches 22 entries.
The undo page occupies more than 3/4 size of the physical page.
The undo state on the page page is marked as TRX_UNDO_PURGE indicating that
the rseg should be freed completely whenever purge thread decides to run
trx_purge_truncate_history
8. For the undo page which had update (9127)
The page does not get filled as much compared to page (556)
Each undo entry in this page is marked as TRX_UNDO_CACHED.
The history length is intact and contains the transaction for Insert
(2921666622), indicating that history length compaction is not yet performed
by the purge thread.
9. InnoDB does not rewrite the roll ptr of a page after an undo log has finished
processing. This means that if an undo page was freed and later another undo
path accesses it skipping the purge view semantics, it will read a undo page
which does not belong it and hence reads garbage data.
10. For DELETE, the following events were registered in the Redo Log:
MLOG_TRX_UNDO_DEL_MARK_REC
MLOG_REC_SEC_DELETE_MARK
MLOG_COMP_REC_CLUST_DELETE_MARK
11. For INSERT, the following events were registered in the Red Log:
MLOG_TRX_UNDO_UPD_DEL_REC
MLOG_COMP_REC_UPDATE_IN_PLACE
MLOG_COMP_REC_UPDATE_IN_PLACE
12. Prior to version 5.7.18, trx_purge_truncate_history only cleaned up undo
pages which was marked as TRX_UNDO_PAGE in the undo state. The function does
not change the history length for other pages whose undo state !=
TRX_UNDO_PAGE
13. trx_purge_truncate_history was called and page 556 was truncated out
by the purge system and was and assigned to another reseg.
14. Because the undo page for 9127 was not full, even though the undo log was
processed the history length was never updated. (default behaviour prior to
5.7.18)
15. Server was shut down and the node was brought up, because page 9127 still
had its history list intact, the bootstrap of the innodb system brought this
page into the purge queue. (they were already processed before the shutdown)
16. The undo log in page 9127 were processed in the purge queue after restart,
the undo log processing fetched clustered index and then fetched the older
roll ptr corresponding to page (556), which was already truncated and as a
result when this roll ptr (page 556) was fetched it read a garbage value
17. This brings down the MySQL as it manifests as a large allocation.
18. We haven't seen any crashes in 5.7.30 and will continue to monitor it

Thank you for your feedback.

Also, always try to use latest release available ......