MySQL Bugs: #120646: Lock wait timeout with high MaxSendDelay

Bug #120646	Lock wait timeout with high MaxSendDelay
Submitted:	9 Jun 14:50
Reporter:	cundi fang	Email Updates:
Status:	Open	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	9.3.0	OS:	Ubuntu (22.04)
Assigned to:		CPU Architecture:	Any
Tags:	lock-wait-timeout, maxsenddelay, metadata-show, ndb, NDB_MGM, timing
Description:
I observed a lock wait timeout in MySQL Cluster Community Server 9.3.0-cluster after increasing MaxSendDelay and running a metadata SHOW command between workload phases.

Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- Per cluster:
  - 1 management node (ndb_mgmd)
  - 4 data nodes (ndbmtd)
  - 4 SQL/API nodes (mysqld)

I first found this in a side-by-side A/B comparison, but the problem can be stated as one concrete NDB issue.

Configuration:
- baseline side:
  [ndbd default]
  MaxSendDelay=0
- mutated side:
  [ndbd default]
  MaxSendDelay=11000

What I did:
1. Started healthy clusters and verified both NDB and SQL health.
2. Created database depstatepp_bughunt and 4 NDB tables:
   - trial_case_000008_depstate_canary
   - trial_case_000008_0
   - trial_case_000008_1
   - trial_case_000008_2
3. Loaded the same initial data into the 3 main tables.
4. Ran a pre-action workload successfully on both sides.
5. Executed the same management metadata command on both sides:
   ndb_mgm -e SHOW
6. Immediately after that, ran a post-action workload using 4 concurrent SQL clients.

What I expected:
I expected the post-action workload either to complete on both sides or at least to fail symmetrically if the cluster was not ready.

What actually happened:
- The SHOW command succeeded on both sides.
- The baseline side completed the post-action workload successfully.
- The mutated side failed in 3 of 4 concurrent SQL clients with:
  ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Failing clients on the mutated side:
- sql_client_0 on depstate-b-ndb1
- sql_client_1 on depstate-b-ndb2
- sql_client_3 on depstate-b-ndb4

The only mutated-side client that completed successfully was:
- sql_client_2 on depstate-b-ndb3

The workload shape was:
- 4 concurrent SQL clients
- one canary UPSERT per client
- then UPDATE / COUNT-SUM style statements on NDB tables
- the failure occurs in the post-action phase immediately after the metadata SHOW command

Health status remained normal according to the bug ledger:
- before the failing phase: both sides healthy
- after the failing phase: both sides still healthy

So the visible symptom is:
after increasing MaxSendDelay and issuing ndb_mgm -e SHOW between workload phases, the immediate concurrent post-action update/scan workload can hit unexpected ERROR 1205 on the mutated side only, while the same workload completes on the baseline side.

How to repeat:
This issue appears timing-sensitive, but the key trigger sequence is:
high MaxSendDelay + metadata SHOW + immediate concurrent post-action update/scan workload.

Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd
- 4 x mysqld

Configuration to test:
[ndbd default]
MaxSendDelay=11000

For comparison, the baseline configuration that completed successfully was:
[ndbd default]
MaxSendDelay=0

Steps:

1. Start the cluster and wait until all data nodes and SQL/API nodes are healthy:
   ndb_mgm -e "SHOW"
   mysql -udepstate -pdepstate -h 127.0.0.1 -e "SELECT 1"

2. Create the database and tables:

   CREATE DATABASE IF NOT EXISTS depstatepp_bughunt;
   USE depstatepp_bughunt;

   CREATE TABLE IF NOT EXISTS trial_case_000008_depstate_canary (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000008_0 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000008_1 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000008_2 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

3. Insert the same initial data into trial_case_000008_0 / 1 / 2.
   Use the same case_id and seed for repeated runs.

4. Run a pre-action workload successfully.

5. Immediately after the pre-action workload, run on the management node:
   ndb_mgm -e "SHOW"

6. Immediately after step 5, start 4 concurrent SQL clients.
   In each client, run a post-action script shaped like:

   USE depstatepp_bughunt;

   INSERT INTO trial_case_000008_depstate_canary
   (case_id,table_idx,k,node_hint,payload,v)
   VALUES('case_000008',-1,<client_id>,<client_id>,'depstate_canary:case_000008:post_action:<client_id>',900000+<client_id>)
   ON DUPLICATE KEY UPDATE
     node_hint=VALUES(node_hint),
     payload=VALUES(payload),
     v=VALUES(v);

   UPDATE trial_case_000008_0
   SET v=v+7
   WHERE case_id='case_000008';

   SELECT COUNT(*), COALESCE(SUM(v),0)
   FROM trial_case_000008_0;

   UPDATE trial_case_000008_1
   SET v=v+7
   WHERE case_id='case_000008';

   SELECT COUNT(*), COALESCE(SUM(v),0)
   FROM trial_case_000008_1;

   UPDATE trial_case_000008_2
   SET v=v+7
   WHERE case_id='case_000008';

   SELECT COUNT(*), COALESCE(SUM(v),0)
   FROM trial_case_000008_2;

7. Observe whether some concurrent clients fail with:
   ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Observed failing run:
- mutated side (MaxSendDelay=11000):
  3 of 4 clients failed with ERROR 1205
- baseline side (MaxSendDelay=0):
  the same workload completed successfully
- the SHOW command had succeeded on both sides before the failure

The key thing to test is whether increasing MaxSendDelay makes the immediate post-SHOW concurrent update/scan workload hit lock wait timeout on NDB tables.

Suggested fix:
Please investigate whether increasing MaxSendDelay can expose unstable lock behavior for concurrent post-action update/scan workloads immediately after a successful ndb_mgm -e SHOW command.

If this is expected, it would help to document it more clearly. Otherwise, lock handling in this timing window may need investigation, because the same workload succeeds on the baseline configuration while the higher MaxSendDelay configuration hits ERROR 1205.