Bug #110869 Debug build asserts in innodb.ddl_crash_alter_partition test
Submitted: 28 Apr 2023 15:48 Modified: 5 Oct 2023 22:44
Reporter: Dmitry Lenev (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S6 (Debug Builds)
Version:8.0.33 OS:Any
Assigned to: CPU Architecture:Any
Tags: regression

[28 Apr 2023 15:48] Dmitry Lenev
Description:
Debug build of MySQL Server 8.0.33 consistently fails on assertion in innodb.ddl_crash_alter_partition test.
This has not happened in 8.0.32 (at least consistently).

Here is the shortened stacktrace for this assertion failure:

mysql-server/sql/dd/impl/cache/dictionary_client.cc:940: bool dd::cache::Dictionary_client::acquire(const K&, const T**, bool*, bool*) [with K = dd::Item_name_key; T = dd::Abstract_table]: Assertion `MDL_checker::is_read_locked(m_thd, *object)' failed.

...

bool dd::cache::Dictionary_client::acquire<dd::Item_name_key, dd::Abstract_table>(dd::Item_name_key const&, dd::Abstract_table const**, bool*, bool*)
bool dd::cache::Dictionary_client::acquire<dd::Table>(std::__cxx11::basic_string<char, std::char_traits<char>, Stateless_allocator<char, dd::String_type_alloc, My_free_functor> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, Stateless_allocator<char, dd::String_type_alloc, My_free_functor> > const&, dd::Table const**)
dd_table_open_on_name(THD*, MDL_ticket**, char const*, bool, unsigned long, int*)
row_rename_table_for_mysql(char const*, char const*, dd::Table const*, trx_t*, bool)
Log_DDL::replay_rename_table_log(char const*, char const*)
Log_DDL::replay(DDL_Record&)
Log_DDL::replay_by_thread_id(unsigned long)
Log_DDL::post_ddl(THD*)mysql_rm_table_no_locks(THD*, Table_ref*, bool, bool, bool, bool*, std::set<handlerton*, std::less<handlerton*>, std::allocator<handlerton*> >*, Foreign_key_parents_invalidator*, std::vector<MDL_ticket*, std::allocator<MDL_ticket*> >*)
mysql_rm_table(THD*, Table_ref*, bool, bool)
mysql_execute_command(THD*, bool)

How to repeat:
Just run:

./mtr --mem innodb.ddl_crash_alter_partition

For debug build of the server.
[28 Apr 2023 15:49] Dmitry Lenev
Simplified test case for MTR framework which causes the same crash:

--source include/have_debug.inc
--source include/not_valgrind.inc
--source include/not_crashrep.inc
--source include/big_test.inc

let $table_name= t1;

let $create_statement= CREATE TABLE $table_name
(a INT NOT NULL AUTO_INCREMENT PRIMARY KEY, b VARCHAR(100), c INT NOT NULL, KEY(c)) ENGINE=InnoDB
PARTITION BY RANGE(a) PARTITIONS 2
 ( PARTITION p0 VALUES LESS THAN (10),
   PARTITION p1 VALUES LESS THAN (20));

let $insert_statement= INSERT INTO t1 VALUES (1, "Filler1", 1),
(2, "filler2", 2), (3, "FILLER3", 3), (14, "Filler4", 14), (15, "15", 15),
(16, "filler6", 16);

CREATE TABLE t2 (a INT NOT NULL AUTO_INCREMENT PRIMARY KEY, b VARCHAR(100), c INT NOT NULL, KEY(c));

let $test_free= 0;
let $test_delete= 0;
let $test_rename= 1;
let $test_drop= 0;
let $crash_statement= ALTER TABLE $table_name EXCHANGE PARTITION p1 WITH TABLE t2;
--source ../mysql-test/suite/innodb/include/ddl_crash_statement.inc
[28 Apr 2023 15:52] Dmitry Lenev
And here is some preliminary analysis for the root of the problem:

Investigation has shown that this assertion failure  seems to be "regression" from the following change in MySQL 8.0.33:

commit d22d33203944002f4d44a71b6c68508a10d80180
Author: Mohammad Tafzeel Shams <mohammad.tafzeel.shams@oracle.com>
Date:   Mon Feb 27 09:31:18 2023 +0100
    Bug #33700835 : Crash during background rollback if INSTANT DDL have
    committed on same table
    PROBLEM :
    - Server at restart resurrects the table lock on active transactions.
    - Then INSTANT DDL on this table closes the dict_table_t and remove it
      from dict_sys_t cache without taking X-LOCK.
    - However, dict_table_t is still used by background ib_tx_recov thread.
    - And server might encounter deadlock due to difference in lock
      ordering between INPLACE DDL & recovery (Bug #34478563).
    FIX :
    - To protect DDL/DML concurrency during recovery, before server starts
      accepting user connection ib_tx_recov thread will take Shared MDL on
      tables which need rollback.
    - And the spawning of ib_tx_recov is moved post DDL recovery.
    Change-Id: Icf53c2aa78c0b7d0b97529733e8453e5756c5ef2

This change moves asynchronous transaction rollback after DDL log processing/recovery.

So as far as I understand:

In 8.0.33 rollback of transactions happens in two steps: 1) BEFORE DDL log handling, synchronously in which only transactions for DDL are rolled back, 2) AFTER DDL log handling, asynchronously in which all transactions are rolled back.
In 8.0.32 this was done in two steps as well, but with at twist: 1) BEFORE DDL log handling, synchonously in which only transactions for DDL are rolled back. 2) BEFORE DDL log handling asynchronously in which all transactions are rolled back.

For normal atomic DDL it is supposed to work OK in both 8.0.33 and 8.0.32 since its transactions are marked as DDL/dict operations and rolled back during step 1).

However, ALTER TABLE EXCHANGE PARTITIONS statement which causes the problem  in this case seems not to mark its transaction as DDL/dict operation properly, hence the problem.
In 8.0.32 it worked, since rollback step 2) managed to rollback this transaction before DDL log handling, most of the time, and in 8.0.33 this problem have got exposed clearly.

The reason why ALTER TABLE EXCHANGE PARTITIONS doesn't mark its transaction in undo log as DDL/DD operation is that its undo log entry allocated earlier than we add items to DDL log or access DD during the calls responsible for dropping SDIs, which happen before we update DD or call appropriate DDL-related call (exchange_partition in this case) in InnoDB:

bool Sql_cmd_alter_table_exchange_partition::exchange_partition(
    THD *thd, Table_ref *table_list, Alter_info *alter_info) {

...

  if (dd::sdi::drop_all_for_table(thd, swap_table_def) ||
      dd::sdi::drop_all_for_table(thd, part_table_def))
{     return true;   }

  int ha_error = part_handler->exchange_partition(swap_part_id, part_table_def,
                                                  swap_table_def);

It needs to be figured out whether problem can be correctly solved by making SDI manipulation calls to mark transaction as DDL/DD operation and if it makes sense to add more asserts ensuring that any operations changing DDL log happen as part of transaction which undo log is marked correctly.
[29 Apr 2023 10:49] MySQL Verification Team
Hello Dmitry,

Thank you for the report and feedback.

Sincerely,
Umesh
[29 Apr 2023 11:08] MySQL Verification Team
As Dmitry rightly pointed, this seems to be a regression

 ./mtr --mem innodb.ddl_crash_alter_partition --debug-server
Logging: ./mtr  --mem innodb.ddl_crash_alter_partition --debug-server
MySQL Version 8.0.32
Checking supported features
 - Binaries are debug compiled
Using 'all' suites
Collecting tests
Checking leftover processes
Removing old var directory
Creating var directory '/export/home/tmp/ushastry/mysql-8.0.32/mysql-test/var'
 - symlinking 'var' to '/dev/shm/var_auto_mMhk'
Installing system database
Using parallel: 1

==============================================================================
                  TEST NAME                       RESULT  TIME (ms) COMMENT
------------------------------------------------------------------------------
[ 50%] innodb.ddl_crash_alter_partition          [ pass ]  548181
[100%] shutdown_report                           [ pass ]
------------------------------------------------------------------------------
The servers were restarted 0 times
The servers were reinitialized 0 times
Spent 548.181 of 575 seconds executing testcases

Completed: All 2 tests were successful.
[5 Oct 2023 22:44] Philip Olson
Posted by developer:
 
Fixed as of the upcoming MySQL Server 8.0.35 and 8.2.0 releases, and here's the proposed changelog entry from the documentation team:

Fixed a potential transaction rollback issue stemming from the 
ALTER TABLE EXCHANGE PARTITIONS statement.

Thank you for the bug report.
[17 Dec 2023 8:24] MySQL Verification Team
Bug #113418 marked as duplicate of this one