MySQL Bugs: #108036: Server assertion crash after upgrade on 8.0.30

Bug #108036	Server assertion crash after upgrade on 8.0.30
Submitted:	29 Jul 2022 19:50	Modified:	8 Nov 2023 14:43
Reporter:	Pavel Dvořák	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
2022-07-29T19:24:46.458710Z 1 [ERROR] [MY-013183] [InnoDB] Assertion failure: log0files_governor.cc:1818:header.m_log_flags == log.m_log_flags thread 139856577439296
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
19:24:46 UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x5635cd228ed0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f32e59e1bf0 thread_stack 0x100000
/usr/sbin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x3d) [0x5635c860461d]
/usr/sbin/mysqld(print_fatal_signal(int)+0x2f3) [0x5635c7518ba3]
/usr/sbin/mysqld(my_server_abort()+0x7e) [0x5635c7518cee]
/usr/sbin/mysqld(my_abort()+0xa) [0x5635c85fe7fa]
/usr/sbin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x336) [0x5635c88e8c96]
/usr/sbin/mysqld(+0x22df67c) [0x5635c879867c]
/usr/sbin/mysqld(+0x22e0405) [0x5635c8799405]
/usr/sbin/mysqld(log_files_persist_flags(log_t&, unsigned int)+0x50) [0x5635c8799f40]
/usr/sbin/mysqld(log_persist_disable(log_t&)+0x97) [0x5635c87b0987]
/usr/sbin/mysqld(mtr_t::Logging::disable(THD*)+0x2bf) [0x5635c87da7df]
/usr/sbin/mysqld(log_sys_init(bool, unsigned long, unsigned long&)+0x2c69) [0x5635c87aeb69]
/usr/sbin/mysqld(srv_start(bool)+0x10fe) [0x5635c889eaee]
/usr/sbin/mysqld(+0x22327b5) [0x5635c86eb7b5]
/usr/sbin/mysqld(dd::bootstrap::DDSE_dict_init(THD*, dict_init_mode_t, unsigned int)+0x9a) [0x5635c83a69ea]
/usr/sbin/mysqld(dd::upgrade_57::do_pre_checks_and_initialize_dd(THD*)+0x1a5) [0x5635c85d7485]
/usr/sbin/mysqld(+0x1120fa6) [0x5635c75d9fa6]
/usr/sbin/mysqld(+0x26ba0c5) [0x5635c8b730c5]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7d80) [0x7f32f2a8cd80]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x3f) [0x7f32f2121baf]

Rollback to .29 fixed it.
Not sure which logs are meant, removing bin log files didn't help.

How to repeat:
upgrade mysql on Debian Bullseye

The error message is about Redo Log.
Redo log files should not be manually deleted, because they contain part of modifications to your tables which was not yet applied to tablespacefiles - i.e. you risk data loss removing them.

Could you please describe the scenario a bit more?
1. Did you perform a clean (a.k.a slow) shutdown before upgrading?
2. Was this error observed in first run of the 8.0.30 binary, or after one of subsequent restarts?
3. Could you share the results of `ls -lh #innodb_redo`?
4. Could you also share the first 512 bytes of the #innodb_redo/#ib_redo[0-9]+ file with highest number (ignoring those with _tmp suffix)? I mean:
```
Find last file redo log file:
$ ls '#innodb_redo/' | grep '[0-9]$' | sort -to -n -k2 | tail -n1
Get first 512 bytes of it:
$ xxd -l 512 '#innodb_redo/#ib_redo11'
```
5. What filesystem are you using for the '#innodb_redo' directory?
6. Could you please share the whole error log?
7. Was there a core dump file generated during the failure?

While waiting for more info from reporter, I did some code analysis.
The callstack suggests we were executing this part:
```
void log_persist_disable(log_t &log) {
  log_update_flags(log, [](Log_flags &log_flags) {
    log_file_header_set_flag(log_flags, LOG_HEADER_FLAG_NO_LOGGING);
    log_file_header_set_flag(log_flags, LOG_HEADER_FLAG_CRASH_UNSAFE);
  });
}
```
which calls `log_update_flags(..)` which does::
```
  Log_flags log_flags = log.m_log_flags;
  update_function(log_flags);

  const dberr_t err = log_files_persist_flags(log, log_flags);
```
which calls `log_files_persist_flags(..)` which does:
```
  const dberr_t err = log_files_rewrite_newest_header(
      log, [&](Log_file_id, Log_file_header &header) {
        ut_a(header.m_log_flags == log.m_log_flags);
        header.m_log_flags = log_flags;
      });
```
which calls `log_files_rewrite_newest_header(..)` which does:
```
  auto it = log.m_files.end();
  --it;

  auto header = log_files_prepare_header(log, *it);

  update_callback(it->m_id, header);
```
where `log_files_prepare_header(..)` does:
```
  auto file_header = log_files_prepare_header(log, file.m_start_lsn);
  if (file.m_full) {
    log_file_header_set_flag(file_header.m_log_flags,
                             LOG_HEADER_FLAG_FILE_FULL);
  }
  return file_header;
```
where `log_files_prepare_header(..)` does:
```
  file_header.m_log_flags = log.m_log_flags;
```
The assertion failure is in the `update_handler` passed to `log_files_rewrite_newest_header(..)`
```
ut_a(header.m_log_flags == log.m_log_flags);
```
As you can see, the `header` in this assert is the one returned by `log_files_prepare_header(...*it)`, which in turn is the `file_header` prepared by `log_files_prepare_header(..file.m_start_lsn)` which in turn should have `file_header.m_log_flags = log.m_log_flags`, except for one tinny bit:
```
  if (file.m_full) {
    log_file_header_set_flag(file_header.m_log_flags,
                             LOG_HEADER_FLAG_FILE_FULL);
  }
```
This suggests that the failure of assert either means that:
a) I've misunderstood the code
b) The file.m_full was true, and thus the header flags were different by one bit (LOG_HEADER_FLAG_FILE_FULL) than expected
c) "The last file" has changed meaning between the calls (this shouldn't happen I think, as we hold log_files_mutex_enter while in `log_update_flags`)
d) something else

I bet on b) now.

Also, please note that being in this place in code during boot suggests, that the server was restarted while having redo disabled, so it is not safe to crash!
According to https://dev.mysql.com/doc/refman/8.0/en/innodb-redo-log.html 
> This feature is intended only for loading data into a new MySQL instance. Do not disable redo logging on a production system. It is permitted to shutdown and restart the server while redo logging is disabled, but an unexpected server stoppage while redo logging is disabled can cause data loss and instance corruption. 

So it might be difficult to help you in this case to restore the system.
Did you use  ALTER INSTANCE DISABLE INNODB REDO_LOG statement before shutdown/crash?

One possibility is that you've tried to upgrade from 8.0.29 instance which was shut down after ALTER INSTANCE DISABLE INNODB REDO_LOG, but before  ALTER INSTANCE ENABLE INNODB REDO_LOG.
In this case it looks like `log_files_find_and_analyze` will execute the else branch of:
```
    if (format >= Log_format::VERSION_8_0_30) {
      ut_a(i == 0 || file_headers[i - 1].m_header.m_start_lsn <
                         file_headers[i].m_header.m_start_lsn);

      files.add(file_sizes[i].m_id, file_sizes[i].m_size_in_bytes,
                file_headers[i].m_header.m_start_lsn,
                log_file_header_check_flag(file_headers[i].m_header.m_log_flags,
                                           LOG_HEADER_FLAG_FILE_FULL),
                encryption_metadata);
    } else {
      files.add(file_sizes[i].m_id, file_sizes[i].m_size_in_bytes, 0, true,
                encryption_metadata);
    }
```
passing `true` as the `bool full` argument of constructor, which means the existing (8.0.29) files will have `m_full==true`.
This could lead to the symptom of the assert failure.

While we should perhaps fix our code to not trigger the assertion failure in this particular scenario, please note, that we do not support upgrading from 8.0.29 unless you did a clean shutdown.
I see that currently there's nothing preventing our users from performing a "slow shutdown" with redo log disabled:
```
mysql> set global innodb_fast_shutdown=0;
Query OK, 0 rows affected (0.00 sec)

mysql>  ALTER INSTANCE DISABLE INNODB REDO_LOG;
Query OK, 0 rows affected (0.01 sec)

mysql> shutdown;
Query OK, 0 rows affected (0.00 sec)
```
So perhaps we should additionally make it clear in our https://dev.mysql.com/doc/refman/8.0/en/upgrade-prerequisites.html that you also need to have redo log enabled?

Hi Mr.  Dvořák,

We are waiting on your feedback.

Assuming that the reported error has happened immediately on first run of 8.0.30 on the data-dir created by 8.0.29, then the redo log files would still have old names and old location typical for 8.0.29, therefore instead of looking into `#innodb_redo/` as I asked earlier, please inspect the data-dir itself (where the logs were stored in 8.0.29) and provide headers (first 512 bytes) of all redo log files (their names should look like ib_logfile0, ib_logfile1, ..).

1. Did you perform a clean (a.k.a slow) shutdown before upgrading?
I did apt upgrade, so my guess is that the upgrade script did a proper shutdown. I did not stop the server manually before the upgrade.

2. Was this error observed in first run of the 8.0.30 binary, or after
one of subsequent restarts?
My guess is the first run and all the others. The mysql service reported too many failures and did shutdown. 

3. Could you share the results of `ls -lh #innodb_redo`?
I can't see such directory.
4... Not relevant
In the archive.
6. I don't have the first one, but I got some after trying to solve it changed the config.
7. Was there a core dump file generated during the failure?
No

I did a backup of all the log files before upgrading here:
https://dvorak.digital:8101/f/6a1d85a2c7b645069179/?dl=1

Thank you for providing the files!
```
$ xxd -l 512 redo/ib_logfile0
00000000: 0000 0005 0000 0000 0000 0050 f328 4000  ...........P.(@.
00000010: 4d79 5351 4c20 382e 302e 3239 0000 0000  MySQL 8.0.29....
00000020: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000030: 0000 0001 0000 0000 0000 0000 0000 0000  ................
00000040: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000050: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000060: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000070: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000080: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000090: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000000f0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000100: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000110: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000120: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000130: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000140: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000150: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000160: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000170: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000180: 0000 0000 0000 0000 0000 0000 0000 0000  ................
00000190: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001a0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001b0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001c0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001d0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001e0: 0000 0000 0000 0000 0000 0000 0000 0000  ................
000001f0: 0000 0000 0000 0000 0000 0000 ba17 ccce  ................
```
In log0constants.h we see:
```
constexpr os_offset_t LOG_HEADER_CREATOR = 16;
constexpr size_t LOG_HEADER_CREATOR_MAX_LENGTH = 31;

constexpr os_offset_t LOG_HEADER_CREATOR_END =
    LOG_HEADER_CREATOR + LOG_HEADER_CREATOR_MAX_LENGTH + 1;
constexpr os_offset_t LOG_HEADER_FLAGS = LOG_HEADER_CREATOR_END;
```
So, flags are at offset 48 = 0x30, and have 32 bits, and we see:
```
00000030: 0000 0001
```
So, there's just one flag set: 1.
The flags are:
```
/** Flag at BIT-1 to indicate if redo logging is disabled or not. */
constexpr uint32_t LOG_HEADER_FLAG_NO_LOGGING = 1;

/** Flag at BIT-2 to indicate if server is not recoverable on crash. This
is set only when redo logging is disabled and unset on slow shutdown after
all pages are flushed to disk. */
constexpr uint32_t LOG_HEADER_FLAG_CRASH_UNSAFE = 2;

/** Flag at BIT-3 to indicate if server is not recoverable on crash because
data directory still has not been fully initialized. */
constexpr uint32_t LOG_HEADER_FLAG_NOT_INITIALIZED = 3;

/** Flag at BIT-4 to mark the redo log file as completely full and closed
for any future writes. */
constexpr uint32_t LOG_HEADER_FLAG_FILE_FULL = 4;
```

Thus, we can see that the redo logging was disabled, but the instance was considered crash safe.
This can happen when one executes ALTER INSTANCE DISABLE INNODB REDO_LOG and gives the server to write all dirty pages from buffer pool to disc so that there are no "unapplied" changes, and the redo log is logically empty.

I've unpacked your redo log files to a data-dir of mine 8.0.32-tr instance, and went step-by-step in debugger, and verified that my initial analysis was correct: the files get marked as full, the redo logs have LOG_HEADER_FLAG_NO_LOGGING  but no LOG_HEADER_FLAG_CRASH_UNSAFE, so we don't give up, instead we try to disable the logging, and because of the mismatch of LOG_HEADER_FLAG_FILE_FULL we crash on assert.

Out of curiosity, I've deleted the assert to see how far could we get without it.
Not very far. We'd crash on assert:
```
  ut_a(header.m_start_lsn >= LOG_START_LSN);
```
in
```
>	log_file_header_validate(const Log_file_header & header) Line 447	C++
 	log_file_header_serialize(const Log_file_header & header, unsigned char * buf) Line 460	C++
 	log_file_header_write(Log_file_handle & file_handle, const Log_file_header & header) Line 505	C++
 	log_files_rewrite_newest_header(log_t & log, std::function<void __cdecl(unsigned __int64,Log_file_header &)> update_callback) Line 1811	C++
 	log_files_persist_flags(log_t & log, unsigned int log_flags) Line 1815	C++
```
which doesn't hold because the field is set according to file.m_start_lsn which is set to zero in the call we've already seen:
```
      files.add(file_sizes[i].m_id, file_sizes[i].m_size_in_bytes, 0, true,
                encryption_metadata);
```
Basically, this files.add call tries to add some "fake" file entry with just enough info to make it possible to go forward.
Note that this particular assert originally had `|| header.m_start_lsn == 0` but was changed as effect of discussion in #comment274977 of review board, as it was not expected that we will ever (re)write headers in legacy format.

Out of curiosity, I tried to fix this, by filling the correct value of m_start_lsn.
According to log0constants.h:
```
/** LSN of the start of data in this log file (with format version 1 and 2). */
constexpr os_offset_t LOG_HEADER_START_LSN = 8;
```
is the offset at which the redo log file should store the start_lsn value, and it is 8 bytes long (lsns are 64-bit).
Looks like it is:
```
0000 0050 f328 4000 for redo/ib_logfile0  = 347676884992
0000 0050 f028 4800 for redo/ib_logfile1  = 347626555392
```
I've verified that file_headers[i].m_header.m_start_lsn indeed has above values, and modified the call:
```
diff --git a/storage/innobase/log/log0files_finder.cc b/storage/innobase/log/log0files_finder.cc
index 51595bb574f..1bed9fc1875 100644
--- a/storage/innobase/log/log0files_finder.cc
+++ b/storage/innobase/log/log0files_finder.cc
@@ -382,7 +382,8 @@ Log_files_find_result log_files_find_and_analyze(
                                            LOG_HEADER_FLAG_FILE_FULL),
                 encryption_metadata);
     } else {
-      files.add(file_sizes[i].m_id, file_sizes[i].m_size_in_bytes, 0, true,
+      files.add(file_sizes[i].m_id, file_sizes[i].m_size_in_bytes,
+                file_headers[i].m_header.m_start_lsn, true,
                 encryption_metadata);
     }
   }
```
After recompilation and restart I went much further this time, only to realize that I am rewriting a "wrong" file.
The LOG_HEADER_FLAG_NO_LOGGING flag was found in ib_logfile0 originally, but the function for disabling redo log tries to write to "latest" file.
Actually the ib_logfile0  *is* the latest in terms of LSNs, but the filename is smaller than ib_logfile1, so the code writes the updated flags to ib_logfile1.
So, we end up in situation where ib_logfile0 has LOG_HEADER_FLAG_NO_LOGGING , but ib_logfile1 has both LOG_HEADER_FLAG_NO_LOGGING and LOG_HEADER_FLAG_CRASH_UNSAFE.
This doesn't matter much anyway, as we are going to remove all these files and create new in new format.
The server seems to continue without any error printed.
I could connect to it and insert a record to my t1 table.
But when I've tried to issue `SHUTDOWN;` command it crashed on assert:
```
Assertion failure: log0log.cc:1466:log_file_header_check_flag(log_flags, LOG_HEADER_FLAG_NO_LOGGING)
 	log_persist_crash_safe::__l2::<lambda>(unsigned int & log_flags) Line 1466	C++
 	log_update_flags<void <lambda>(unsigned int &)>(log_t & log, log_persist_crash_safe::__l2::void <lambda>(unsigned int &) update_function) Line 1443	C++
 	log_persist_crash_safe(log_t & log) Line 1469	C++
>	buf_flush_page_coordinator_thread() Line 3568	C++
```
I think this happens, because the mtr system believed that the logging is disabled (as it was disabled during recovery when we've found LOG_HEADER_FLAG_NO_LOGGING flag), but the flags in the newly created redo log did not indicate this. Basically there's a desynch between mtr and logging systems: one believes logging is disabled, the other thinks it is not.

Hi to all,
Seems that we have similar one issue with version 8.0.30 and 8.0.32.

Many thanks in advance,
Aleksei Kurepin

Hi Mr. Dvorak,

Based on the conclusions of our colleagues, this is now a fully verified bug.