Bug #107959 copy ddl rollback cause some inconsistence
Submitted: 25 Jul 2022 9:25 Modified: 27 Jul 2022 12:59
Reporter: Haiping Zhang Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: DDL Severity:S2 (Serious)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: ddl log

[25 Jul 2022 9:25] Haiping Zhang
Description:
If a copy ddl on test/t1 rollbacks just before deleteing the backup of the old table, and the dict cache of test/t1 has just been removed from memory, Log_DDL::replay_rename_table_log will fail and report some error like:

```
2022-07-25T08:56:01.483738Z 10 [ERROR] [MY-012237] [InnoDB] Corrupted page [page id: space=7, page number=0] of datafile './test/t1.ibd' could not be found in the doublewrite buffer.
2022-07-25T08:56:01.483754Z 10 [Warning] [MY-012192] [InnoDB] Failed to read the first page of the file './test/t1.ibd'. You will need to verify and move the file out of the way retry recovery.
```

This ddl will return error, but next sql on this table will cause crash:

```
2022-07-25T08:56:01.491464Z 10 [ERROR] [MY-013183] [InnoDB] Assertion failure: dict0dict.cc:1222:table2 == nullptr thread 140192165865216
```

How to repeat:
diff --git a/sql/sql_table.cc b/sql/sql_table.cc
index b60844a71ac..55ddd4f9207 100644
--- a/sql/sql_table.cc
+++ b/sql/sql_table.cc
@@ -17912,6 +17912,11 @@ bool mysql_alter_table(THD *thd, const char *new_db, const char *new_name,
     goto err_with_mdl; /* purecov: deadcode */
   }

+  DBUG_EXECUTE_IF("alter_table_force_rollback_before_rm_table", {
+    my_error(ER_UNKNOWN_ERROR, MYF(0));
+    goto err_with_mdl;
+  });
+
   // ALTER TABLE succeeded, delete the backup of the old table.
   if (quick_rm_table(thd, old_db_type, alter_ctx.db, backup_name,
                      FN_IS_TMP | (atomic_replace ? NO_DD_COMMIT : 0))) {
diff --git a/storage/innobase/dict/dict0dict.cc b/storage/innobase/dict/dict0dict.cc
index d8c8f6c6c15..cc2634359f4 100644
--- a/storage/innobase/dict/dict0dict.cc
+++ b/storage/innobase/dict/dict0dict.cc
@@ -1329,7 +1329,8 @@ ulint dict_make_room_in_cache(

   i = len = UT_LIST_GET_LEN(dict_sys->table_LRU);

-  if (len < max_tables) {
+  if (len < max_tables &&
+      DBUG_EVALUATE_IF("ib_force_evict_target_table", false, true)) {
     return (0);
   }

@@ -1340,9 +1341,19 @@ ulint dict_make_room_in_cache(

   /* Find a suitable candidate to evict from the cache. Don't scan the
   entire LRU list. Only scan pct_check list entries. */
+  auto should_check_table = [&](const dict_table_t *table) -> bool {
+    if (DBUG_EVALUATE_IF("ib_force_evict_target_table", true, false)) {
+       if (!table) return false;
+       if (strcmp(table->name.m_name, "test/t1") == 0) {
+         ib::warn() << "force to evict table cache:test/t1";
+        return true;
+       }
+    }
+    return table && i > check_up_to && (len - n_evicted) > max_tables;
+  };

   for (table = UT_LIST_GET_LAST(dict_sys->table_LRU);
-       table != nullptr && i > check_up_to && (len - n_evicted) > max_tables;
+       should_check_table(table);
        --i) {
     dict_table_t *prev_table;

diff --git a/storage/innobase/log/log0ddl.cc b/storage/innobase/log/log0ddl.cc
index c5dd637c547..18cfe558db7 100644
--- a/storage/innobase/log/log0ddl.cc
+++ b/storage/innobase/log/log0ddl.cc
@@ -1822,6 +1822,8 @@ void Log_DDL::replay_rename_table_log(const char *old_name,
   trx->mysql_thd = current_thd;
   trx_start_if_not_started(trx, true, UT_LOCATION_HERE);

+  DEBUG_SYNC(trx->mysql_thd, "replay_rename_table_before_table_open");
+
   row_mysql_lock_data_dictionary(trx, UT_LOCATION_HERE);
   trx_set_dict_operation(trx, TRX_DICT_OP_TABLE);

MTR:

```
--source include/have_debug.inc
connect (con1,localhost,root,,);

create table test.t1(a int key);
set global debug = "+d,alter_table_force_rollback_before_rm_table";
set debug_sync = "replay_rename_table_before_table_open wait_for go";
--connection con1
--send alter table test.t1 add column b int, algorithm=copy
--sleep 3

--connection default
set global debug = "+d,ib_force_evict_target_table";
--sleep 5
set debug_sync = "now signal go";

--connection con1
--error ER_UNKNOWN_ERROR
--reap

set global debug = "-d,alter_table_force_rollback_before_rm_table";
set global debug = "-d,ib_force_evict_target_table";
select * from test.t1;
drop table test.t1;

```

Suggested fix:
May be we shouldn't rename tablespace when replaying Log_Type::RENAME_TABLE_LOG,
for that this action will be done by Log_Type::RENAME_SPACE_LOG
[25 Jul 2022 12:15] MySQL Verification Team
Hi Mr. Zhang,

Thank you for your bug report.

We can see that you are using latest available 8.0 release for your analysis.

We have inspected carefully your analysis and your patches and we find that they are essentially correct, although some changes might be necessary.

Verified as reported.

Thank you for your contribution.
[25 Jul 2022 12:41] Haiping Zhang
Revert the [Suggested fix]

1. What we should do is to skip the evicted table in row_rename_table_for_mysql, just like 

```
diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc
index 6e430c26a6f..4e377281108 100644
--- a/storage/innobase/row/row0mysql.cc
+++ b/storage/innobase/row/row0mysql.cc
@@ -4182,6 +4182,19 @@ dberr_t row_rename_table_for_mysql(const char *old_name, const char *new_name,
     goto funct_exit;
   }

+  /* A special case during replay ddl log where we try to open the
+  table (t1) but get the temp table cache element (#sql2-xxx). */
+  if (replay && table && !table->is_intrinsic() &&
+      strcmp(old_name, table->name.m_name) != 0) {
+    ut_a(!old_is_tmp && new_is_tmp &&
+        row_is_mysql_tmp_table_name(table->name.m_name));
+    ib::error() << "Replaying ddl log fetchs wrong cache element. "
+             << "target name:" << old_name << ", fetched name:"
+              << table->name.m_name;
+    err = DB_TABLE_NOT_FOUND;
+    goto funct_exit;
+  }
+
```

2. dd_table_open_on_dd_obj return directly once it find a dict_table_t with same
table id in table_id_hash. This is somewhat not rigorous.

3. It's also confusing why we should rename tablespace during applying Log_Type::RENAME_TABLE_LOG, since Log_Type::RENAME_SPACE_LOG does the same thing.

Thanks very much
[27 Jul 2022 12:59] Haiping Zhang
if dict table of #sql2-xxx is also evicted, other errors will happen.

I think we can't load info from dd storage when ddl rollback, just skip if dict cache is not in memory.
[27 Jul 2022 13:02] MySQL Verification Team
Hi,

Thank you for the additional info .....
[2 Aug 2024 6:52] zongyi chen
it happend to me too.it will lead to crash and destroy atomicity ddl. so should it be fixed soon?
[2 Aug 2024 9:35] MySQL Verification Team
Hi Mr. chen,

We truly do not know.

Each Development team has schedules of it's own, which are changed on the weekly basis.

Hence, nobody can truly answer your question.