MySQL Bugs: #118319: Assertion failure `Ndb_metadata::compare(thd, thd_ndb->ndb, dbname, ndbtab, new_table

Bug #118319	Assertion failure `Ndb_metadata::compare(thd, thd_ndb->ndb, dbname, ndbtab, new_table_def)' on metadata check
Submitted:	30 May 21:28	Modified:	2 Jun 9:50
Reporter:	Congyu Liu (OCA)	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	8.4.2, 8.0.42	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hi! We found a bug in MySQL NDB API node that triggers the assertion failure `virtual bool ha_ndbcluster::commit_inplace_alter_table(TABLE*, Alter_inplace_info*, bool, const dd::Table*, dd::Table*): Assertion `Ndb_metadata::compare(thd, thd_ndb->ndb, dbname, ndbtab, new_table_def)' failed`.

Here's the analysis:
When handling the index schema update event, the API node will invalidate the table cache. However, in some unfortunate circumstances, a stale table cache (which does not contain the latest index information) can be created immediately after invalidation, allowing it to escape being invalidated. This will trigger the assertion when an ALTER command is processed locally.

Below is a log snippet. I also uploaded the full logs.
```
# <------------ T@2 starts processing a schema update event for new index 'nkugn'
T@2: Ndb_schema_event_handler::handle_schema_op_post_epoch: enter: dstestdb.dstesttable: query: 'CREATE INDEX nkugn ON dstesttable (id, c1, c2, c3) USING BTREE'  type: 13
T@2: GlobalDictCache::invalidateTable: enter: internal_name: 'dstestdb/def/dstesttable'
T@2: GlobalDictCache::invalidateTable: exit: no cached versions
...
T@2: Ndb_binlog_client::event_exists_for_table: info: event_name_for_table: REPL$dstestdb/dstesttable
# <------------ pause T@2 after it tries to invalidate the table cache
# <------------ T@12 starts processing a SELECT command
...
T@12: open_table_from_share: enter: name: 'dstestdb.dstesttable'  form: 0x7ffeb40a17a0
T@12: get_new_handler: enter: alloc: 0x7ffeb40a2038
...
T@12: release_table_share: enter: share: 0x7ffeb4008520  table: dstestdb.dstesttable  ref_count: 1  version: 4
T@12: release_table_share: info: moving share to unused list
...
T@12: ndbcluster_global_schema_lock: exit: global_schema_lock_count: 1
# <------------ now a stale table cache is created, which does not include the index 'nkugn'
# <------------ pause T@12 to prevent it from getting GSL
# <------------ resume T@2 to complete the schema update
...
# <------------ T@13 starts processing an ALTER command
# <------------ T@13 uses the stale table cache
...
T@13: open_and_process_table: tcache: opening table: 'dstestdb'.'dstesttable'  item: 0x7ffeb8017b48
T@13: process_found_table_share: info: Unlinking from not used list
T@13: process_found_table_share: exit: share: 0x7ffeb4008520 ref_count: 1
T@13: process_found_table_share: info: share has index PRIMARY, address 0x7ffeb4097c38
# <------------ T@13 crashes
2025-05-30T21:01:21.189670Z 13 [ERROR] [MY-010865] [NDB] Metadata check has failed
2025-05-30T21:01:21.189718Z 13 [ERROR] [MY-010865] [NDB] The NDB Dictionary table definition is not identical to the DD table definition
2025-05-30T21:01:21.189750Z 13 [ERROR] [MY-010865] [NDB] Diff in 'index_name' detected, 'nkugn' != ''
mysqld: /mysql-server/storage/ndb/plugin/ha_ndbcluster.cc:16608: virtual bool ha_ndbcluster::commit_inplace_alter_table(TABLE*, Alter_inplace_info*, bool, const dd::Table*, dd::Table*): Assertion `Ndb_metadata::compare(thd, thd_ndb->ndb, dbname, ndbtab, new_table_def)' failed.
```

How to repeat:
Here's the cluster setup: 1 management node, 2 (replica) data nodes and 2 API node.

Here's the input (a shell script):

```shell
# /tmp/mysql1.sock -> api node 1
# /tmp/mysql2.sock -> api node 2
mysql --socket="/tmp/mysql2.sock" -u "root" --password="" -e "CREATE DATABASE dstestdb; USE dstestdb; CREATE TABLE dstesttable (id tinyint NOT NULL AUTO_INCREMENT, c1 CHAR(1), c2 CHAR(1), c3 CHAR(1), PRIMARY KEY (id)) ENGINE=NDB;"
sleep 1
mysql --socket="/tmp/mysql1.sock" -u "root" --password="" -e "SET GLOBAL debug = 'd,my_slow_down,index_stat,info,enter,exit,warning:i';"
mysql --socket="/tmp/mysql2.sock" -u "root" --password="" -e "USE dstestdb; SET GLOBAL debug = 'd:i'; CREATE INDEX nkugn ON dstesttable (id, c1, c2, c3) USING BTREE;" &
sleep 0.4 && mysql --socket="/tmp/mysql1.sock" -u "root" --password="" -e "USE dstestdb; SELECT dstesttable.c2,dstesttable.c3,dstesttable.id,dstesttable.c1 FROM dstesttable WHERE dstesttable.id<212757430 ORDER BY c2 DESC,c3 DESC,id DESC,c1 DESC for update;" &
sleep 0.8
mysql --socket="/tmp/mysql1.sock" -u "root" --password="" -e "SET GLOBAL debug = 'd:i'; USE dstestdb; ALTER TABLE dstesttable ADD COLUMN (kugncyayc int(11) NULL);"
# api node 1 crashes
```

This bug requires a certain event order. Here's the patch to enforce the ordering:
```diff
diff --git a/storage/ndb/plugin/ndb_binlog_client.cc b/storage/ndb/plugin/ndb_binlog_client.cc
index 4757c5cb9a8..ca3b2e039e0 100644
--- a/storage/ndb/plugin/ndb_binlog_client.cc
+++ b/storage/ndb/plugin/ndb_binlog_client.cc
@@ -156,6 +156,12 @@ std::string Ndb_binlog_client::event_name_for_table(const char *db,
   name.append(db).append("/").append(table_name);
 
   DBUG_PRINT("info", ("event_name_for_table: %s", name.c_str()));
+  DBUG_EXECUTE_IF("my_slow_down", {
+    if (strncmp(name.c_str(), "REPL$dstestdb/dstesttable", 25) == 0) {
+      DBUG_PRINT("info", ("blocking binlog client"));
+      my_sleep(time_t(1000000));  // sleep 1s
+    }
+  });
 
   return name;
 }
diff --git a/storage/ndb/plugin/ndb_global_schema_lock.cc b/storage/ndb/plugin/ndb_global_schema_lock.cc
index fd53cdf39a3..a4229ae1034 100644
--- a/storage/ndb/plugin/ndb_global_schema_lock.cc
+++ b/storage/ndb/plugin/ndb_global_schema_lock.cc
@@ -376,6 +376,11 @@ static int ndbcluster_global_schema_lock(THD *thd,
   */
   Thd_proc_info_guard proc_info(thd);
   proc_info.set("Waiting for ndbcluster global schema lock");
+  DBUG_EXECUTE_IF("my_slow_down", {
+    DBUG_PRINT("info", ("Prevent from getting GSL"));
+    my_sleep(time_t(10000000));  // sleep 10s
+    DBUG_SET("-d,my_slow_down");
+  });
   thd_ndb->global_schema_lock_trans =
       gsl_lock_ext(thd, ndb, ndb_error, true /* retry */, false /* no_wait */);
 

```

Log on crash:
```
T@13: log_write_errstream: enter: buffer: 2025-05-30T21:01:21.189670Z 13 [ERROR] [MY-010865] [NDB] Metadata check has failed
2025-05-30T21:01:21.189670Z 13 [ERROR] [MY-010865] [NDB] Metadata check has failed
T@13: log_write_errstream: enter: buffer: 2025-05-30T21:01:21.189718Z 13 [ERROR] [MY-010865] [NDB] The NDB Dictionary table definition is not identical to the DD table definition
2025-05-30T21:01:21.189718Z 13 [ERROR] [MY-010865] [NDB] The NDB Dictionary table definition is not identical to the DD table definition
T@13: log_write_errstream: enter: buffer: 2025-05-30T21:01:21.189750Z 13 [ERROR] [MY-010865] [NDB] Diff in 'index_name' detected, 'nkugn' != ''
2025-05-30T21:01:21.189750Z 13 [ERROR] [MY-010865] [NDB] Diff in 'index_name' detected, 'nkugn' != ''
mysqld: /mysql-server/storage/ndb/plugin/ha_ndbcluster.cc:16608: virtual bool ha_ndbcluster::commit_inplace_alter_table(TABLE*, Alter_inplace_info*, bool, const dd::Table*, dd::Table*): Assertion `Ndb_metadata::compare(thd, thd_ndb->ndb, dbname, ndbtab, new_table_def)' failed.
2025-05-30T21:01:21Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
BuildID[sha1]=25e752b31a3e669bb46a1d1ddcf6552b6aae1ea5
Thread pointer: 0x7ffeb8001050
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7fffd01649d8 thread_stack 0x100000
 #0 0x555559072a17 print_fatal_signal at /mysql-server/sql/signal_handler.cc:319
 #1 0x555559072ca6 _Z19handle_fatal_signaliP9siginfo_tPv at /mysql-server/sql/signal_handler.cc:399
 #2 0x7ffff744251f <unknown> at sysdeps/unix/sysv/linux/x86_64/libc_sigaction.c:0
 #3 0x7ffff74969fc __pthread_kill_implementation at ./nptl/pthread_kill.c:44
 #4 0x7ffff74969fc __pthread_kill_internal at ./nptl/pthread_kill.c:78
 #5 0x7ffff74969fc __GI___pthread_kill at ./nptl/pthread_kill.c:89
 #6 0x7ffff7442475 __GI_raise at sysdeps/posix/raise.c:26
 #7 0x7ffff74287f2 __GI_abort at ./stdlib/abort.c:79
 #8 0x7ffff742871a __assert_fail_base at ./assert/assert.c:94
 #9 0x7ffff7439e95 __GI___assert_fail at ./assert/assert.c:103
 #10 0x55555b073b4c _ZN13ha_ndbcluster26commit_inplace_alter_tableEP5TABLEP18Alter_inplace_infobPKN2dd5TableEPS5_ at /mysql-server/storage/ndb/plugin/ha_ndbcluster.cc:16608
 #11 0x55555924ee72 _ZN7handler29ha_commit_inplace_alter_tableEP5TABLEP18Alter_inplace_infobPKN2dd5TableEPS5_ at /mysql-server/sql/handler.cc:4998
 #12 0x555558ef0084 mysql_inplace_alter_table at /mysql-server/sql/sql_table.cc:13848
 #13 0x555558efc416 _Z17mysql_alter_tableP3THDPKcS2_P14HA_CREATE_INFOP9Table_refP10Alter_info at /mysql-server/sql/sql_table.cc:17680
 #14 0x5555595f3e89 _ZN19Sql_cmd_alter_table7executeEP3THD at /mysql-server/sql/sql_alter.cc:350
 #15 0x555558e0040c _Z21mysql_execute_commandP3THDb at /mysql-server/sql/sql_parse.cc:4737
 #16 0x555558e028db _Z20dispatch_sql_commandP3THDP12Parser_state at /mysql-server/sql/sql_parse.cc:5392
 #17 0x555558df8119 _Z16dispatch_commandP3THDPK8COM_DATA19enum_server_command at /mysql-server/sql/sql_parse.cc:2136
 #18 0x555558df5e7a _Z10do_commandP3THD at /mysql-server/sql/sql_parse.cc:1465
 #19 0x55555905ba14 handle_connection at /mysql-server/sql/conn_handler/connection_handler_per_thread.cc:304
 #20 0x55555b422c9a pfs_spawn_thread at /mysql-server/storage/perfschema/pfs.cc:3061
 #21 0x7ffff7494ac2 start_thread at ./nptl/pthread_create.c:442
 #22 0x7ffff752684f <unknown> at sysdeps/unix/sysv/linux/x86_64/clone3.S:81
 #23 0xffffffffffffffff <unknown>

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (7ffeb8016a60): ALTER TABLE dstesttable ADD COLUMN (kugncyayc int(11) NULL)
Connection ID (thread ID): 13
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
Information that should help you find out what is causing the crash.
```

Suggested fix:
If the stale cache indeed causes this bug, a simple way to fix it would be to check the freshness of the table cache before use.

Hello Congyu Liu,

Thank you for the report and feedback.
Verified as described.

regards,
Umesh