MySQL Bugs: #109677: Crashing after upgrade to MySQL 8.0.31

Bug #109677	Crashing after upgrade to MySQL 8.0.31
Submitted:	18 Jan 2023 10:26	Modified:	8 Feb 2023 20:07
Reporter:	Chris Hynes	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S1 (Critical)
Version:	8.0.31	OS:	Linux
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
We recently upgraded to MySQL 8.0.31 on RDS. The upgrade went fine, but a couple days later the instance crashed because of data corruption and wouldn't restart. We ended up having to go boot in innodb_force_recovery, take a logical dump, and restore back to an older version.

This was not the same data corruption problem from 8.0.29 as we have not run any DDL since the upgrade, so it must be something new and unique to 8.0.31.

Thus far we have not been able to reproduce the issue in a non-production environment and only have the logs noted in "how to repeat".

Is this a known issue in 8.0.31 or is there additional research we can do to help trace the bug?

How to repeat:
2022-12-26T16:23:48.789416Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2022-12-26T16:23:49.642847Z 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: btr0cur.cc:3660:rec thread 23121744611072
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com  .
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html  
InnoDB: about forcing recovery.
2022-12-26T16:23:49Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x150779052000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...

===

2022-12-26T22:38:21.113187Z 303 [ERROR] [MY-012855] [InnoDB] Clustered record for sec rec not found index `IX_UserPayment_QueryFields` of table `ownerrez_prod`.`UserPayment`
InnoDB: sec index record PHYSICAL RECORD: n_fields 3; compact format; info bits 0
0: len 4; hex 14b4c728; asc (;;
1: SQL NULL;
2: len 4; hex 0002ac3a; asc :;;

InnoDB: clust index record PHYSICAL RECORD: n_fields 25; compact format; info bits 128
0: len 4; hex 0002ac39; asc 9;;
1: len 6; hex 0000fd37aa5e; asc 7 ^;;
2: len 7; hex 0200007dff2b60; asc } +`;;
3: len 4; hex 14b53248; asc 2H;;
4: len 4; hex 00099425; asc %;;
5: len 1; hex 01; asc ;;
6: SQL NULL;
7: SQL NULL;
8: SQL NULL;
9: len 7; hex 99aeb5350b1b44; asc 5 D;;
10: len 5; hex 8000008300; asc ;;
11: SQL NULL;
12: len 4; hex 002a0661; asc * a;;
13: len 5; hex 8000000000; asc ;;
14: len 5; hex 8000008300; asc ;;
15: SQL NULL;
16: len 4; hex 00000000; asc ;;
17: SQL NULL;
18: SQL NULL;
19: len 5; hex 99aeb5350c; asc 5 ;;
20: len 4; hex 14b53248; asc 2H;;
21: len 5; hex 99aeb5350d; asc 5 ;;
22: len 4; hex 14b53248; asc 2H;;
23: SQL NULL;
24: SQL NULL;

TRANSACTION 4248447749, ACTIVE 1 sec starting index read
mysql tables in use 1, locked 0
12 lock struct(s), heap size 1128, 14 row lock(s), undo log entries 10
MySQL thread id 303, OS thread handle 22368288478976, query id 1466843 10.0.9.202 ownerrez_prodsecure executing
select coalesce(sum(NetAmount), 0) from UserPayment where DeletedUtc is null and UserId = 347391784

===

2022-12-26T23:51:00.137207Z 14 [System] [MY-010562] [Repl] Slave I/O thread for channel '': connected to master 'rdsrepladmin@172.17.3.60:3306',replication started in log 'mysql-bin-changelog.000044' at position 4322927
2022-12-26T23:51:00.272166Z 16 [ERROR] [MY-013183] [InnoDB] Assertion failure: row0ins.cc:267:!cursor->index->is_committed() thread 22751385995008
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com .
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html 
InnoDB: about forcing recovery.
2022-12-26T23:51:00Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x14ae7a223000
Attempting backtrace. You can use the following i
00:08:21 AM Arun: 112: Trying to get some variables.
113: Some pointers may be invalid and cause the dump to abort.
114: Query (14bf1f71b636): INSERT INTO ChangeScope ( UserId, ActionUserId, ActionImpersonatedFromUserId, ApiApplicationId, ApiAccessTokenId, IpAddress, UserAgentId, EntrySource, Url, EntityCount, FieldCount, FieldContentLength, Duration, S3Key, CreatedUtc) VALUES (  347418793,  NULL,  NULL,  NULL,  NULL,  '34.239.0.3',  80,  1,  '/3',  2,  0,  NULL,  94,  NULL,  timestamp('2022-12-26 22:37:57.070563'))

Can we see table structure of ChangeScope ?
From which version did you upgrade.  i.e.  what was the version that last did a full rebuild of this table?

Any progress on tracking down the issue?

There are additional reports of this same bug here: https://repost.aws/questions/QUZZaycd4OSY2k8iLNVFuXvA

Hi,

RDS uses their own "storage system" under the InnoDB and this looks like something RDS exclusive. We cannot reproduce this using supported MySQL binary so this is something AWS needs to fix.

The only way we could be involved is if you can give us a reproducible test case that will reproduce this issue using supported MySQL binary.

For contribution, we have received this problem 2 times this week...

```
[ERROR] [MY-013183] [InnoDB] Assertion failure: btr0cur.cc:3660:rec thread 70398784480976
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
2023-01-26T12:28:54Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x4006a5ceb000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 4006fe8a9e60 thread_stack 0x40000
/rdsdbbin/mysql/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x30) [0x1f8f990]
/rdsdbbin/mysql/bin/mysqld(print_fatal_signal(int)+0x31c) [0xfb83fc]
/rdsdbbin/mysql/bin/mysqld(my_server_abort()+0x98) [0xfb85d8]
/rdsdbbin/mysql/bin/mysqld(my_abort()+0x14) [0x1f884f4]
/rdsdbbin/mysql/bin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x27c) [0x22ba11c]
/rdsdbbin/mysql/bin/mysqld(btr_cur_optimistic_update(unsigned long, btr_cur_t*, unsigned long**, mem_block_info_t**, upd_t const*, unsigned long, que_thr_t*, unsigned long, mtr_t*)+0x74c) [0x22fa4cc]
/rdsdbbin/mysql/bin/mysqld() [0x2486170]
/rdsdbbin/mysql/bin/mysqld() [0x24865fc]
/rdsdbbin/mysql/bin/mysqld(row_undo_mod(undo_node_t*, que_thr_t*)+0xcec) [0x2488c2c]
/rdsdbbin/mysql/bin/mysqld(row_undo_step(que_thr_t*)+0x58) [0x22449d8]
/rdsdbbin/mysql/bin/mysqld(que_run_threads(que_thr_t*)+0x688) [0x21f0188]
/rdsdbbin/mysql/bin/mysqld() [0x2299b2c]
/rdsdbbin/mysql/bin/mysqld() [0x229a8c0]
/rdsdbbin/mysql/bin/mysqld(trx_rollback_for_mysql(trx_t*)+0xd8) [0x229ccd8]
/rdsdbbin/mysql/bin/mysqld() [0x20f1f00]
/rdsdbbin/mysql/bin/mysqld(ha_rollback_low(THD*, bool)+0xf4) [0x10b3374]
/rdsdbbin/mysql/bin/mysqld(trx_coordinator::rollback_in_engines(THD*, bool)+0x30) [0xf68f10]
/rdsdbbin/mysql/bin/mysqld(MYSQL_BIN_LOG::rollback(THD*, bool)+0xec) [0x1be7f0c]
/rdsdbbin/mysql/bin/mysqld(ha_rollback_trans(THD*, bool)+0x90) [0x10b35b0]
/rdsdbbin/mysql/bin/mysqld(trans_rollback(THD*)+0x70) [0xf6c110]
/rdsdbbin/mysql/bin/mysqld(mysql_execute_command(THD*, bool)+0x2670) [0xe61b30]
/rdsdbbin/mysql/bin/mysqld(dispatch_sql_command(THD*, Parser_state*)+0x340) [0xe63380]
/rdsdbbin/mysql/bin/mysqld(dispatch_command(THD*, COM_DATA const*, enum_server_command)+0x15c0) [0xe64e60]
/rdsdbbin/mysql/bin/mysqld(do_command(THD*)+0x1e0) [0xe66160]
/rdsdbbin/mysql/bin/mysqld() [0xfaa2a8]
/rdsdbbin/mysql/bin/mysqld() [0x252fe90]
/lib64/libpthread.so.0(+0x722c) [0x40003b95f22c]
/lib64/libc.so.6(+0xd4a1c) [0x40003bb53a1c]
Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (4006dc34d030): rollback
Connection ID (thread ID): 51286
Status: NOT_KILLED
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
```

AWS is suggesting that they have no bug in the storage layer, but that the issue was caused by an ALGORITHM=INSTANT column added in 8.0.27 before the upgrade that then caused an incorrect record size to be calculated in 8.0.31 later on -- related to this fix: https://github.com/mysql/mysql-server/commit/1070fe24851c41ac127bee2239b5c94296fcb464

Is that a possibility here, that the bug was caused by an alter in v 8.0.27 but not revealed until the upgrade to 8.0.31? And that will be fixed in 8.0.32?

I found the bug AWS is suggesting in the release notes: "After a column added with ALGORITHM=INSTANT, an online rebuild DDL no longer crashes."

This seems unlikely to be the issue here since we didn't run any DDL after the upgrade. Is there possibly a related issue?

Hi,

> AWS is suggesting that they have no bug in the storage layer

So do we. Difference is, if you use our binary you do not have that problem and if you use RDS there is a problem so I tend to believe issue is RDS specific and since Amazon never released any of their changes to MySQL and only take take take and never give back anything we can't do much about that.

> This seems unlikely to be the issue

Seems like AWS is trying to blame this on anyone but them (this time us). If we cannot reproduce the problem we are not likely going to fix it. And since we know of noone who is experiencing same problem as you when using our binaries there's really not much we can do.

If AWS engineers are sure, they can easily make a reproducible test case with our binary and report a bug (such bugs are solved rather quickly)