MySQL Bugs: #82736: Undo log corruption when using generated columns

Bug #82736	Undo log corruption when using generated columns
Submitted:	26 Aug 2016 3:01	Modified:	8 Apr 2019 14:00
Reporter:	Jeremy Cole (Basic Quality Contributor) (OCA)	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.7.11	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	generated columns, innodb, json, undo

Description:
First: Sorry for the type of report I am filing. I usually prefer to have a reproducible test case, but have thus far been unable to reproduce myself.

I have recently seen some instances of undo log corruption, resulting in an unrecoverable database, which appears to be caused by usage of generated columns. The actual failure occurs, in our case and using a optimized production binary, due to failure to allocate approximately 2^64 bytes of memory while reading/applying undo log records in the background purge thread. This happens shortly after crash recovery, and results in a crash loop there. This is the resulting stack trace:

#0   ib::fatal_or_error::~fatal_or_error() ()
#1   mem_heap_create_block_func(mem_block_info_t*, unsigned long, unsigned long) ()
#2   mem_heap_add_block(mem_block_info_t*, unsigned long) ()
#3   mem_heap_dup(mem_block_info_t*, void const*, unsigned long) ()
#4   trx_undo_get_undo_rec_low(unsigned long, mem_block_info_t*, bool) ()
#5   trx_undo_prev_version_build(unsigned char const*, mtr_t*, unsigned char const*, dict_index_t*, unsigned long*, mem_block_info_t*, unsigned char**, mem_block_info_t*, dtuple_t const**, unsigned long) ()
#6   row_vers_old_has_index_entry(unsigned long, unsigned char const*, mtr_t*, dict_index_t*, dtuple_t const*, unsigned long, unsigned long) ()
#7   row_purge_poss_sec(purge_node_t*, dict_index_t*, dtuple_t const*) ()
#8   btr_cur_search_to_nth_level(dict_index_t*, unsigned long, dtuple_t const*, page_cur_mode_t, unsigned long, btr_cur_t*, unsigned long, char const*, unsigned long, mtr_t*) ()
#9   row_search_index_entry(dict_index_t*, dtuple_t const*, unsigned long, btr_pcur_t*, mtr_t*) ()
#10  row_purge_remove_sec_if_poss(purge_node_t*, dict_index_t*, dtuple_t const*) ()
#11  row_purge_step(que_thr_t*) ()
#12  que_run_threads(que_thr_t*) ()
#13  srv_worker_thread ()
#14  start_thread ()
#15  clone ()

When running with a debug binary (and after the instance had crash looped, so may not be in the same state as it was initially) we get an assertion here:

InnoDB: Assertion failure in thread 140674743645952 in file rem0rec.ic line 1205
InnoDB: Failing assertion: n < rec_offs_n_fields(offsets)

Prior to researching this crash, I had no idea about the impact that generated columns would have on undo logging, but wow. I strongly suspect that something is still not right there, and that this breaks crash recovery in some way when generated columns are in use.

For some more context, in case it helps: In this case the user used JSON_EXTRACT() from JSON-typed columns, to bring individual values (BIGINT, VARCHAR, etc.) out as indexed generated columns.

In the process of looking into this bug, we also found that some of the affected code is unnecessarily fragile, hence the memory allocation error rather than something more sensible. This whole code path could probably use some cleanup for clarity and provable correctness. Please also see: http://bugs.mysql.com/bug.php?id=82734

How to repeat:
Exercise InnoDB crash recovery while generated columns (potentially based on JSON data?) are in use.

Suggested fix:
Any advice or pointers are welcome.

Hi Jeremy,

Thank you for your bug report. However, there are so many bugs, related to the virtual columns, that are fixed since 5.7.11. 5.7.13 and 5.7.14 have so many fixed bugs that resulted in assertions ...

Hence, can you test your data with 5.7.14 or even better 5.7.15 when it comes out. That would save us so many troubles, especially you, who would have to devise a repeatable test case. 

You could also read our Changelog since 5.7.11 and see if any bug looks like yours.

I would love to try 5.7.14 but the changes between .11 and .14 have caused us many troubles in merging so far. So I am not yet able to do so; even so, the problem is not exactly reproducible, it depends on some more or less unknown workload by some application out of our control.  I have examined all of the changes between .11 and .14 which could seemingly be related, but did not see anything that would probably fix this issue. Hence this vague bug. ;)

Jeremy,

.15 should not be far off either, so give it a try ....

That is the best we can do when we can't have a test case, no matter how large or complicated.

You could also examine the public bugs mentioned in the Changelog and that might give you some ideas ....

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Jeremy,

Is there any feedback from you ????

Have you been able to repeat the corruption with latest 5.7 ????