MySQL Bugs: #120645: Lock wait timeout with larger EventLogBufferSize

Bug #120645	Lock wait timeout with larger EventLogBufferSize
Submitted:	9 Jun 14:49
Reporter:	cundi fang	Email Updates:
Status:	Open	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	9.3.0	OS:	Ubuntu (22.04)
Assigned to:		CPU Architecture:	Any
Tags:	DDL, eventlogbuffersize, lock-wait-timeout, metadata, ndb, NDB_MGM
Description:
I observed a lock wait timeout in MySQL Cluster Community Server 9.3.0-cluster after increasing EventLogBufferSize and running a metadata SHOW command between workload phases.

Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- Per cluster:
  - 1 management node (ndb_mgmd)
  - 4 data nodes (ndbmtd)
  - 4 SQL/API nodes (mysqld)

I first found this in a side-by-side A/B comparison, but the problem can be stated as one concrete NDB issue.

Configuration:
- baseline side:
  [ndbd default]
  EventLogBufferSize=8192
- mutated side:
  [ndbd default]
  EventLogBufferSize=32768

What I did:
1. Started healthy clusters and verified both NDB and SQL health.
2. Created database depstatepp_bughunt and 4 NDB tables:
   - trial_case_000004_depstate_canary
   - trial_case_000004_0
   - trial_case_000004_1
   - trial_case_000004_2
3. Loaded the same initial data into the 3 main tables (32 rows per table).
4. Ran a pre-action workload successfully on both sides.
5. Ran a mid-action phase that only read metadata / table contents successfully on both sides.
6. Executed the same management metadata command on both sides:
   ndb_mgm -e SHOW
7. Immediately after that, ran a post-action workload using 4 concurrent SQL clients.

What I expected:
I expected the post-action workload either to complete on both sides or at least to fail symmetrically if the cluster was not ready.

What actually happened:
- The SHOW command succeeded on both sides and reported all 4 data nodes and all 4 API nodes connected.
- The baseline side completed the post-action workload successfully.
- The mutated side failed in all 4 concurrent SQL clients with:
  ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Failing clients on the mutated side:
- sql_client_0 on depstate-b-ndb1:
  ERROR 1205 at line 5
- sql_client_1 on depstate-b-ndb2:
  ERROR 1205 at line 4
- sql_client_2 on depstate-b-ndb3:
  ERROR 1205 at line 4
- sql_client_3 on depstate-b-ndb4:
  ERROR 1205 at line 4

The workload shape was:
- 4 concurrent SQL clients
- one canary UPSERT per client
- CREATE TABLE IF NOT EXISTS on the same NDB tables
- then UPDATE ... SET v=v+7 on the same case_id rows
- followed by COUNT/SUM reads

Health status remained normal during the captured run according to the main bug ledger:
- before the failing phase: both sides healthy
- after the failing phase: both sides still operational

So the visible symptom is:
after increasing EventLogBufferSize and issuing ndb_mgm -e SHOW between workload phases, the immediate metadata/DDL-light post-action workload can hit unexpected ERROR 1205 on the mutated side only, while the same workload completes on the baseline side.

How to repeat:
This issue appears timing-sensitive, but the key trigger sequence is:
larger EventLogBufferSize + metadata SHOW + immediate concurrent metadata/DDL-light post workload.

Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd
- 4 x mysqld

Configuration to test:
[ndbd default]
EventLogBufferSize=32768

For comparison, the baseline configuration that completed successfully was:
[ndbd default]
EventLogBufferSize=8192

Steps:

1. Start the cluster and wait until all data nodes and SQL/API nodes are healthy:
   ndb_mgm -e "SHOW"
   mysql -udepstate -pdepstate -h 127.0.0.1 -e "SELECT 1"

2. Create the database and tables:

   CREATE DATABASE IF NOT EXISTS depstatepp_bughunt;
   USE depstatepp_bughunt;

   CREATE TABLE IF NOT EXISTS trial_case_000004_depstate_canary (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000004_0 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000004_1 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

   CREATE TABLE IF NOT EXISTS trial_case_000004_2 (
     case_id VARCHAR(64) NOT NULL,
     table_idx INT NOT NULL,
     k INT NOT NULL,
     node_hint INT NOT NULL,
     payload VARCHAR(192) NOT NULL,
     v BIGINT NOT NULL,
     PRIMARY KEY(case_id,table_idx,k),
     KEY idx_node_hint(node_hint)
   ) ENGINE=NDBCLUSTER;

3. Insert initial data:
- 32 rows into each of trial_case_000004_0 / 1 / 2
- use case_id='case_000004'
- keep the same seed for repeated runs

4. Run a pre-action phase successfully.

5. Run a light metadata phase successfully, for example:
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000004_0;
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000004_1;
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000004_2;

6. Immediately after that, run on the management node:
   ndb_mgm -e "SHOW"

   In the captured failing run, this command succeeded and showed:
   - 4 data nodes connected
   - 4 mysqld(API) nodes connected

7. Immediately after step 6, start 4 concurrent SQL clients.
   In each client, run a post-action script shaped like:

   USE depstatepp_bughunt;

   INSERT INTO trial_case_000004_depstate_canary
   (case_id,table_idx,k,node_hint,payload,v)
   VALUES('case_000004',-1,<client_id>,<client_id>,'depstate_canary:case_000004:post_action:<client_id>',900000+<client_id>)
   ON DUPLICATE KEY UPDATE
     node_hint=VALUES(node_hint),
     payload=VALUES(payload),
     v=VALUES(v);

   CREATE TABLE IF NOT EXISTS trial_case_000004_0 (...) ENGINE=NDBCLUSTER;
   UPDATE trial_case_000004_0 SET v=v+7 WHERE case_id='case_000004';
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000004_0;

   CREATE TABLE IF NOT EXISTS trial_case_000004_1 (...) ENGINE=NDBCLUSTER;
   UPDATE trial_case_000004_1 SET v=v+7 WHERE case_id='case_000004';
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000004_1;

   CREATE TABLE IF NOT EXISTS trial_case_000004_2 (...) ENGINE=NDBCLUSTER;
   UPDATE trial_case_000004_2 SET v=v+7 WHERE case_id='case_000004';
   SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000004_2;

8. Observe whether the concurrent clients fail with:
   ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction

Observed failing run:
- mutated side (EventLogBufferSize=32768):
  all 4 clients failed with ERROR 1205
- baseline side (EventLogBufferSize=8192):
  the same workload completed successfully
- the metadata SHOW command had succeeded on both sides before the failure

The key thing to test is whether increasing EventLogBufferSize makes the immediate post-SHOW metadata/DDL-light workload hit lock wait timeout on NDB tables.

Suggested fix:
Please investigate whether increasing EventLogBufferSize can expose unstable lock behavior for metadata-heavy or DDL-light post-action workloads immediately after a successful ndb_mgm -e SHOW command.

If this is expected, it would help to document it more clearly. Otherwise, lock handling around this timing window may need investigation, because the same workload succeeds on the baseline configuration while the larger EventLogBufferSize configuration hits ERROR 1205 in all concurrent clients.