MySQL Bugs: #118367: The server may fail to start again if it exits abnormally during the upgrade

Bug #118367	The server may fail to start again if it exits abnormally during the upgrade
Submitted:	5 Jun 9:18	Modified:	6 Jun 10:03
Reporter:	Kaikai Ye	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Data Dictionary	Severity:	S1 (Critical)
Version:	8.0	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
The server may encounter unexpected exit problems during the upgrade process, such as a power outage on the host machine.

I found that such a scenario would cause the server to be unable to restart.

This is a 100% reproducible problem, as described below.

How to repeat:
1. Prepare a lower version of MySQL instance, such as 8.0.22

2. Prepare a higher version of mysqld for upgrading, such as latest 8.0.42

3. Use debugging tool like gdb to start upgrading mysqld to higher version
    gdb --args mysqld --defaults-file=my.cnf --upgrade=FORCE --gdb

4. Interrupt the upgrade process after update_properties() and before update_versions()
    (gdb) b update_versions
    (gdb) r
    Thread 2 "boot" hit Breakpoint 1, dd::update_versions (thd=0xb2e26d0, is_dd_upgrade_57=false)

5. Re-run the upgrade command, the server will abort in both release and debug versions
    (gdb) r
    Thread 2 "boot" received signal SIGABRT, Aborted.
    [Switching to LWP 1922932]
    0x00007ffff749178b in raise () from /usr/lib64/libc.so.6

6. Here is the crash stack.
    assert(tmp_schema != nullptr && tmp_tspace != nullptr);
    (gdb) bt
    #0  0x00007ffff749178b in raise () from /usr/lib64/libc.so.6
    #1  0x00007ffff7492ab1 in abort () from /usr/lib64/libc.so.6
    #2  0x00007ffff748a04a in ?? () from /usr/lib64/libc.so.6
    #3  0x00007ffff748a0c2 in __assert_fail () from /usr/lib64/libc.so.6
    #4  0x0000000004835083 in dd::sync_meta_data (thd=0xb2e26d0) at /opt/workdir/ykk/mysql-community/sql/dd/impl/bootstrap/bootstrapper.cc:1486
    #5  0x000000000483196d in dd::bootstrap::restart (thd=0xb2e26d0) at /opt/workdir/ykk/mysql-community/sql/dd/impl/bootstrap/bootstrapper.cc:941
    #6  0x0000000004a3eebf in dd::upgrade_57::restart_dictionary (thd=0xb2e26d0) at /opt/workdir/ykk/mysql-community/sql/dd/upgrade_57/upgrade.cc:797
    #7  0x0000000004a3f667 in dd::upgrade_57::do_pre_checks_and_initialize_dd (thd=0xb2e26d0) at /opt/workdir/ykk/mysql-community/sql/dd/upgrade_57/upgrade.cc:964
    #8  0x00000000037c3000 in bootstrap::handle_bootstrap (arg=0x7fffffffca80) at /opt/workdir/ykk/mysql-community/sql/bootstrap.cc:328
    #9  0x0000000005626b7b in pfs_spawn_thread (arg=0xb2e1c00) at /opt/workdir/ykk/mysql-community/storage/perfschema/pfs.cc:3050
    #10 0x00007ffff7fa6f3b in ?? () from /usr/lib64/libpthread.so.0
    #11 0x00007ffff7550980 in clone () from /usr/lib64/libc.so.6

7. At this point the server cannot be started again, regardless of whether it is an upgrade process

Suggested fix:
The data dictionary upgrade has a transaction control in upgrade::upgrade_tables(), including steps update_properties() / update_object_ids() / update_versions().

In the above scenario, the properties have been updated and persisted, but the versions is still old. The upgrade is interrupted at this time，so when the server is started again, we should rollback the changes to data dictionary firstly.

However the rollback function DDSE_dict_recover() is executed after the sync_meta_data() that caused the abort. The sync_meta_data() is used to read the persisted objects from the DD tables, and replace the contents of the core registry in the storage adapter. It relies on reading dd_schema and dd_tablespace which is updated
in previous upgrade process but not avaliable yet.

It is unacceptable that a server can no longer start, a possible fix is that do the data dictionary rollback first and then read persisted DD tables. In my testing, it is doable and the server was updated and started fine in this inconsistent scenario.

Hello Kaikai Ye,

Thank you for the report and feedback.

regards,
Umesh