MySQL Bugs: #120644: Lock wait timeout after SHOW with low GCP interval

Bug #120644	Lock wait timeout after SHOW with low GCP interval
Submitted:	9 Jun 14:48
Reporter:	cundi fang	Email Updates:
Status:	Open	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	9.3.0	OS:	Ubuntu (22.04)
Assigned to:		CPU Architecture:	Any
Tags:	checkpoint, lock-wait-timeout, metadata-show, ndb, NDB_MGM, TimeBetweenGlobalCheckpoints
Description:
I observed a lock wait timeout in MySQL Cluster Community Server 9.3.0-cluster after reducing TimeBetweenGlobalCheckpoints and running a metadata SHOW command between two workload phases.

Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- Per cluster:
  - 1 management node (ndb_mgmd)
  - 4 data nodes (ndbmtd)
  - 4 SQL/API nodes (mysqld)

I first found this in a side-by-side A/B comparison, but the problem is specific enough to report as one NDB issue.

Configuration:
- baseline side:
  [ndbd default]
  TimeBetweenGlobalCheckpoints=2000
- mutated side:
  [ndbd default]
  TimeBetweenGlobalCheckpoints=20

What I did:
1. Started healthy clusters and verified both NDB and SQL health.
2. Created database depstatepp_bughunt and 4 NDB tables:
   - trial_case_000000_depstate_canary
   - trial_case_000000_0
   - trial_case_000000_1
   - trial_case_000000_2
3. Loaded the same initial data into the 3 main tables (32 rows per table).
4. Ran a pre-action workload successfully on both sides.
5. Executed the same management metadata command on both sides:
   ndb_mgm -e SHOW
6. Immediately after that, ran a post-action workload using 4 concurrent SQL clients.

What I expected:
I expected the post-action workload either to succeed on both sides or at least to fail symmetrically if the cluster was not ready.

What actually happened:
- The metadata SHOW command succeeded on both sides and showed all 4 data nodes plus all 4 API nodes connected.
- The baseline side completed the post-action workload successfully.
- The mutated side failed in 3 of 4 concurrent SQL clients with:
  ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Failing clients on the mutated side:
- sql_client_0 on depstate-b-ndb1:
  ERROR 1205 at line 5
- sql_client_1 on depstate-b-ndb2:
  ERROR 1205 at line 4
- sql_client_3 on depstate-b-ndb4:
  ERROR 1205 at line 4

The only mutated-side client that completed successfully was:
- sql_client_2 on depstate-b-ndb3

The workload shape was:
- 4 concurrent SQL clients
- each client executes a canary UPSERT
- then CREATE TABLE IF NOT EXISTS / UPDATE / COUNT-SUM style statements on the 3 NDB tables
- the failure occurs in the post-action phase, immediately after the metadata SHOW command

Health status remained good:
- before the failing phase: both sides healthy
- after the failing phase: both sides still healthy

So the visible symptom is:
after reducing TimeBetweenGlobalCheckpoints aggressively and issuing ndb_mgm -e SHOW between workload phases, the next concurrent post-action workload can hit unexpected ERROR 1205 on the mutated side only, even though cluster health remains normal and the same workload completes on the baseline side.

How to repeat:
This issue appears timing-sensitive, but the key trigger sequence is:
low TimeBetweenGlobalCheckpoints + metadata SHOW + immediate concurrent post-action UPDATE workload.

Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd
- 4 x mysqld

Configuration to test:
[ndbd default]
TimeBetweenGlobalCheckpoints=20

For comparison, the baseline configuration that completed successfully was:
[ndbd default]
TimeBetweenGlobalCheckpoints=2000

Steps:

1. Start the cluster and wait until all data nodes and SQL/API nodes are healthy:
   ndb_mgm -e "SHOW"
   mysql -udepstate -pdepstate -h 127.0.0.1 -e "SELECT 1"

2. Create the database and tables:

   CREATE DATABASE IF NOT EXISTS depstatepp_bughunt;
   USE depstatepp_bughunt;

   CREATE TABLE IF NOT EXISTS trial_case_000000_depstate_canary (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000000_0 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000000_1 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000000_2 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

3. Insert initial data:
- 32 rows into each of trial_case_000000_0 / 1 / 2
- use case_id='case_000000'
- keep the same seed for repeated runs

4. Run a pre-action workload successfully.

5. Immediately after the pre-action workload, run on the management node:
   ndb_mgm -e "SHOW"

   In the captured failing run, this command succeeded and showed:
   - 4 data nodes connected
   - 4 mysqld(API) nodes connected

6. Immediately after step 5, start 4 concurrent SQL clients.
   In each client, run a post-action script shaped like:

   USE depstatepp_bughunt;

   INSERT INTO trial_case_000000_depstate_canary
   (case_id,table_idx,k,node_hint,payload,v)
   VALUES('case_000000',-1,<client_id>,<client_id>,'depstate_canary:case_000000:post_action:<client_id>',900000+<client_id>)
   ON DUPLICATE KEY UPDATE
     node_hint=VALUES(node_hint),
     payload=VALUES(payload),
     v=VALUES(v);

   CREATE TABLE IF NOT EXISTS trial_case_000000_0 (...) ENGINE=NDBCLUSTER;
   UPDATE trial_case_000000_0 SET v=v+7 WHERE case_id='case_000000';
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000000_0;

   CREATE TABLE IF NOT EXISTS trial_case_000000_1 (...) ENGINE=NDBCLUSTER;
   UPDATE trial_case_000000_1 SET v=v+7 WHERE case_id='case_000000';
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000000_1;

   CREATE TABLE IF NOT EXISTS trial_case_000000_2 (...) ENGINE=NDBCLUSTER;
   UPDATE trial_case_000000_2 SET v=v+7 WHERE case_id='case_000000';
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000000_2;

7. Observe whether some of the concurrent clients fail with:
   ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Observed failing run:
- mutated side (TimeBetweenGlobalCheckpoints=20):
  3 of 4 clients failed with ERROR 1205
- baseline side (TimeBetweenGlobalCheckpoints=2000):
  same workload completed successfully
- cluster health remained normal before and after the failure

The key thing to test is whether aggressively lowering TimeBetweenGlobalCheckpoints makes the immediate post-SHOW concurrent update workload hit lock wait timeout on NDB tables.

Suggested fix:
Please investigate whether aggressively lowering TimeBetweenGlobalCheckpoints can expose unstable lock behavior for concurrent post-action UPDATE workloads immediately after a successful ndb_mgm -e SHOW metadata command.

If this is an expected limitation, it would help to document it more clearly. Otherwise, lock handling in this timing window may need investigation, because the same workload completes on the baseline configuration while the mutated configuration hits ERROR 1205.