MySQL Bugs: #118168: MySQL data dictionary fails to upgrade at startup with lock timeout on ALTER TABLE

Bug #118168	MySQL data dictionary fails to upgrade at startup with lock timeout on ALTER TABLE
Submitted:	12 May 12:27	Modified:	13 May 11:52
Reporter:	Justin Swanhart	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	8.0.41	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
2025-05-12T12:15:39.980381Z 1 [System] [MY-013576] [InnoDB] InnoDB initialization has started.
2025-05-12T12:15:40.735903Z 1 [System] [MY-013577] [InnoDB] InnoDB initialization has ended.
2025-05-12T12:15:41.269157Z 4 [System] [MY-013381] [Server] Server upgrade from '80039' to '80041' started.
2025-05-12T12:16:31.923113Z 4 [ERROR] [MY-013178] [Server] Execution of server-side SQL statement 'ALTER TABLE slave_worker_info STATS_PERSISTENT=0; ' failed with error code = 1205, error message = 'Lock wait timeout exceeded; try restarting transaction'.
2025-05-12T12:16:31.924793Z 0 [ERROR] [MY-013380] [Server] Failed to upgrade server.
2025-05-12T12:16:31.924814Z 0 [ERROR] [MY-010119] [Server] Aborting

How to repeat:
Restore a backup from 8.0.39
Try to start the database on 8.0.41
Upgrade fails and database is left in unusable state (sev 1 problem)

Suggested fix:
unsure

I restored the next full backup and the server upgraded successfully so this problem won't hit every upgrade.  I guess I just got unlucky.  Still this is a serious problem for some upgrades but it isn't easily reproducible.

Hi,

How did you "restore backup" ?

Thanks.

We use Several Nines Cluster Control for backups which executes Xtrabackup to backup and restore the database.

The backup stream was restored to the database server via the normal process and the database started normally (with rollback crash recovery) and InnoDB initialized successfully before the DD upgrade started (and failed).

I tried to start the database twice.  On the second attempt there was no crash recovery, the upgrade simply started and failed with the same error message.

Hi,

I am not able to reproduce this using supported tools.

If I dump 8.0.39 and load into 8.0.41 it works as expected.

If I shutdown 8.0.39, copy datadir to 8.0.41 and start 8.0.41 it works as expected.

If I use MySQL Enterprise Backup to backup 8.0.39 and restore on 8.0.41 it works as expected.

I do not have xtrabackup nor we support xtrabackup so I cannot test it.

What I see from

> 2025-05-12T12:16:31.923113Z 4 [ERROR] [MY-013178] [Server] Execution of server-side SQL statement 'ALTER TABLE slave_worker_info STATS_PERSISTENT=0; ' failed with error code = 1205, error message = 'Lock wait timeout exceeded; try restarting transaction'.

Looks like xtrabackup stored the datadir and placed it back to 8.0.41. It is possible that when backup was made MySQL server was not in stable state, and s you know in order to upgrade datadir it should be "clean". MySQL does not do crash recovery of the datadir that is not made by same version server.

Can you try to
1. use 8.0.39 with proper status, check that slave_worker_info table is ok
2. shutdown 8.0.39 gracefully
3. copy this datadir to 8.0.41 and start it and let it upgrade
4. report if the upgrade was ok or not

I assume xtrabackup did a flush, read lock and then copied datadir, but datadir was not cleanly closed so you need to do a recovery, MySQL will not do recovery from wrong version. But there might be an error with upgrade too, but this I cannot reproduce.

Since mysql.slave_worker_info is our table and is same in my test and upgrades without a problem I suggest problem is with the datadir and xtrabackup.

Thanks

You don't understand how xtrabackup works, which I understand as it is an "unsupported tool" but at the end of the day it produces a data directory that is consistent with the end of the backup.

I chose to only do forward recovery (as I was recovering from an incremental backup) so partial crash recovery was necessary.

This is the exact same thing as having started a database on a crashed datadir.  Recovery would happen.  This is just common sense.  Regardless, no recovery was necessary on starting the database on a clean state the second time.

The problem has NOTHING to do with crash recovery.  The problem is in an ALTER that is being done during startup.  I suspect that if I started with --skip-slave-start that the problem might go away as something has locked the table and the ALTER can't acquire a lock.

As I noted, a new backup didn't have this same problem so it is not easy to reproduce.  It certainly would not be produced by a dump/restore as it requires some state for the slave table.

I don't think this should be "can't repeat" as code inspection of the upgrade process is probably required to figure out why the table was locked. 

Should be changed to 'analyzing' at least.

Hi,

> it produces a data directory that is consistent with the end of the backup

If it was "consistent" you would not have to do the recovery.

> having started a database on a crashed datadir

MySQL is not always able to upgrade crashed datadir. We even ask you to shutdown properly with innodb_fast_shutdown not set to 2. It is not always the case but it is a proper procedure.

Can you reproduce the problem by loading this datadir to 8.0.39?

> "can't repeat"

I did try your scenario for reproduction numerous times with different configurations and different ways of backup/restore and I am unable to reproduce the problem. So the proper state is "can't repeat", this does not mean I will stop working on the report as long as I'm getting feedback from you. "can't repeat" is not closed/abandoned.

The difference between my setup and your is that I do not have xtrabackup and I have some random databases loaded as well as our my.cnf is probably different.

 - the "database structure" should be irrelevant as you are stuck on alter of table that is identical on all setups
 - that leaves state of the datadir that xtrabackup deployed that is difference

What I will try is to use your identical my.cnf on 8.0.39 and 8.0.41 to see if maybe there is something that will help me reproduce.

Also, I assumed the server in question is a slave so I tested it as a slave and as stand alone, please let me know what is your master and what is your master configuration so that I can completely mirror your setup.

Thanks

this ALTER TABLE slave_worker_info STATS_PERSISTENT=0; seems to come from:
scripts/mysql_system_tables_fix.sql
(same for ALTER TABLE slave_master_info STATS_PERSISTENT=0;) which gets compiled into `const char* mysql_fix_privilege_tables[]` in mysql-bin/scripts/mysql_fix_privilege_tables_sql.h:
```
# Build mysql_fix_privilege_tables.h
ADD_CUSTOM_COMMAND(
  OUTPUT ${CMAKE_CURRENT_BINARY_DIR}/mysql_fix_privilege_tables_sql.h
  ${CAT_COMMAND}
  COMMAND comp_sql
  mysql_fix_privilege_tables
  mysql_fix_privilege_tables.sql
  mysql_fix_privilege_tables_sql.h
  WORKING_DIRECTORY ${CMAKE_CURRENT_BINARY_DIR}
  DEPENDS comp_sql
  ${CMAKE_CURRENT_SOURCE_DIR}/mysql_system_tables.sql
  ${CMAKE_CURRENT_SOURCE_DIR}/mysql_system_tables_fix.sql
)
```
which is used in:
```
bool fix_mysql_tables(THD *thd) {
...
  for (query_ptr = &mysql_fix_privilege_tables[0]; *query_ptr != nullptr;
       query_ptr++)
    if (ignore_error_and_execute(thd, *query_ptr)) return true;
...
}
```
Despite its name it has nothing to do with "privilege".
These two ALTER statements were there at least since 2012-07-04. Presumably every 8.0+ already has it set properly, yet the ALTER rots there, as it is "idempotent" so usually doesn't hurt.
Despite its name ignore_error_and_execute(..) doesn't ignore the error, it returns true in case of error, so does this function and `upgrade_system_schemas(..)` which called it fails.
I believe this function gets called if you do upgrade or patch-downgrade, because of:
```
bool no_server_upgrade_required() {
  return !(
      dd::bootstrap::DD_bootstrap_ctx::instance().is_server_upgrade() ||
      bootstrap::DD_bootstrap_ctx::instance().is_server_patch_downgrade() ||
      opt_upgrade_mode == UPGRADE_FORCE);
}
```
being used in init_server_components():
```
  if (!is_help_or_validate_option() && !opt_initialize &&
      !dd::upgrade::no_server_upgrade_required()) {
    if (opt_upgrade_mode == UPGRADE_MINIMAL)
      LogErr(WARNING_LEVEL, ER_SERVER_UPGRADE_SKIP);
    else {
      init_optimizer_cost_module(true);
      if (bootstrap::run_bootstrap_thread(nullptr, nullptr,
                                          &dd::upgrade::upgrade_system_schemas,
                                          SYSTEM_THREAD_SERVER_UPGRADE)) {
```
and:
```
bool fix_mysql_tables(THD *thd) {
  /* Keep system tables as is for LTS downgrade. */
  if (bootstrap::DD_bootstrap_ctx::instance().is_server_patch_downgrade())
    return false;
```

Of course ALTER TABLE requires various locks.
And one of the resurrected transactions contained locks on this table, presumably the resurrected IX locks needed to protect it from being altered or dropped, until the resurrected transaction commits or rolls-back.
You should be able to spot info in error log like:
MySQL error code MY-014018 (ER_IB_RESURRECT_IDENTIFY_TABLE_TO_LOCK): 
Identified table ID: %lu to acquire lock
MySQL error code MY-014019 (ER_IB_RESURRECT_ACQUIRE_TABLE_LOCK): 
Acquired lock on table ID: %lu, name: %s

I think the way to solve this bug is to either
a) get rid of outdated ALTER statements from this .sql file
b) execute them only if the shutdown was clean, or at least there are no transactions to resurrect
c) or, handle timeout gracefully

this is easy to repeat without xtrabackup. setup a gtid replica and apply a simple insert workload on the source server.  then kill -9 the replica mysqld and upgrade it.

-- 
Shane, MySQL Senior Principal Technical Support Engineer
Oracle Corporation
http://dev.mysql.com/

While not directly related, associating https://bugs.mysql.com/bug.php?id=114821 where foreground configuration/workload can affect the servers ability to successfully upgrade.