Description:
I observed an NDB lock wait timeout that appears during the post-rejoin window after restarting a single data node in MySQL Cluster Community Server 9.3.0-cluster.
Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- Per cluster:
- 1 management node
- 4 data nodes (ndbmtd)
- 4 SQL/API nodes (mysqld)
I first found this with two side-by-side clusters (A/B) under the same workload. The important part is:
- same initial schema
- same initial data
- same workload shape
- same action timing
- only config difference:
[ndbd default] TimeBetweenGlobalCheckpoints
baseline = 2000
mutated = 20
What I did:
1. Created 3 NDB tables plus 1 canary table in database depstatepp_bughunt.
2. Loaded the same initial data into all 3 main tables (128 rows per table).
3. Used 4 concurrent SQL clients, one logical shard per node_hint value (0,1,2,3).
4. Performed a single-node rejoin action on data node 2.
5. In the post-action window, ran concurrent UPDATE statements from the 4 SQL clients.
The post-action pattern for each client was:
- UPSERT one canary row
- UPDATE trial_case_000007_0 SET v=v+7 WHERE case_id='case_000007' AND node_hint=<shard>
- SELECT COUNT(*), SUM(v) from that shard
- UPDATE trial_case_000007_1 ...
- SELECT COUNT(*), SUM(v) ...
- UPDATE trial_case_000007_2 ...
- SELECT COUNT(*), SUM(v) ...
What I expected:
I expected the post-rejoin workload either:
1) to succeed, or
2) to fail symmetrically / predictably if the cluster was not yet ready.
What actually happened:
Only the mutated side hit SQL runtime errors.
Two SQL clients on the mutated side failed with:
ERROR 1205 (HY000) at line 4: Lock wait timeout exceeded; try restarting transaction
The affected clients were:
- sql_client_1 (node_hint=1)
- sql_client_2 (node_hint=2)
In both client scripts, line 4 is the first post-action UPDATE, for example:
UPDATE trial_case_000007_0
SET v=v+7
WHERE case_id='case_000007' AND node_hint=1
and
UPDATE trial_case_000007_0
SET v=v+7
WHERE case_id='case_000007' AND node_hint=2
At the same time, the management status snapshot on the failing side still showed:
Node 2: starting (Last completed phase 100)
while the baseline side showed all 4 data nodes started.
Relevant captured timing from the failing run:
- B-side data node 2 failure became visible in SQL/API logs at:
2026-06-07T13:28:23.760Z
- B-side data node 2 subscribe/rejoin activity became visible at:
2026-06-07T13:28:39.431Z
The visible symptom is:
after a single data node rejoin, while the restarted node is still in the recovery/start window, concurrent shard-local UPDATE statements may hit ERROR 1205 on the mutated configuration, even though the same workload completes on the baseline side.
How to repeat:
This issue appears to be timing-sensitive. The key is to run concurrent shard-local UPDATE statements in the immediate post-rejoin window after restarting one data node.
Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd
- 4 x mysqld
Configuration used in the captured failing run:
[ndbd default]
TimeBetweenGlobalCheckpoints=20
Schema:
CREATE DATABASE IF NOT EXISTS depstatepp_bughunt;
USE depstatepp_bughunt;
CREATE TABLE IF NOT EXISTS trial_case_000007_depstate_canary (
case_id VARCHAR(64) NOT NULL,
table_idx INT NOT NULL,
k INT NOT NULL,
node_hint INT NOT NULL,
payload VARCHAR(192) NOT NULL,
v BIGINT NOT NULL,
PRIMARY KEY(case_id, table_idx, k)
) ENGINE=NDBCLUSTER;
CREATE TABLE IF NOT EXISTS trial_case_000007_0 (
case_id VARCHAR(64) NOT NULL,
table_idx INT NOT NULL,
k INT NOT NULL,
node_hint INT NOT NULL,
payload VARCHAR(192) NOT NULL,
v BIGINT NOT NULL,
PRIMARY KEY(case_id, table_idx, k),
KEY idx_node_hint(node_hint)
) ENGINE=NDBCLUSTER;
CREATE TABLE IF NOT EXISTS trial_case_000007_1 (
case_id VARCHAR(64) NOT NULL,
table_idx INT NOT NULL,
k INT NOT NULL,
node_hint INT NOT NULL,
payload VARCHAR(192) NOT NULL,
v BIGINT NOT NULL,
PRIMARY KEY(case_id, table_idx, k),
KEY idx_node_hint(node_hint)
) ENGINE=NDBCLUSTER;
CREATE TABLE IF NOT EXISTS trial_case_000007_2 (
case_id VARCHAR(64) NOT NULL,
table_idx INT NOT NULL,
k INT NOT NULL,
node_hint INT NOT NULL,
payload VARCHAR(192) NOT NULL,
v BIGINT NOT NULL,
PRIMARY KEY(case_id, table_idx, k),
KEY idx_node_hint(node_hint)
) ENGINE=NDBCLUSTER;
Initial data:
- insert 128 rows into each of trial_case_000007_0 / 1 / 2
- use case_id='case_000007'
- distribute node_hint across 0,1,2,3
- for example, k % 4 can determine node_hint
Steps:
1. Start the cluster and wait until:
ndb_mgm -e "ALL STATUS"
shows all 4 data nodes started.
2. Load the schema and seed data above.
3. Open 4 concurrent SQL sessions.
Map them logically as:
- client 0 -> node_hint=0
- client 1 -> node_hint=1
- client 2 -> node_hint=2
- client 3 -> node_hint=3
4. From the management node, restart a single data node:
ndb_mgm -e "2 RESTART"
5. Do not wait for long full stabilization.
Instead, monitor:
- ndb_mgm -e "ALL STATUS"
- mysqld error log
- especially whether node 2 is still "starting"
6. In the immediate post-rejoin window, while node 2 is still rejoining / starting, run these concurrent client scripts.
For client 1:
USE depstatepp_bughunt;
INSERT INTO trial_case_000007_depstate_canary
(case_id,table_idx,k,node_hint,payload,v)
VALUES('case_000007',-1,1,1,'depstate_canary:case_000007:post_action:sql_client_1:1',900101)
ON DUPLICATE KEY UPDATE
node_hint=VALUES(node_hint),
payload=VALUES(payload),
v=VALUES(v);
UPDATE trial_case_000007_0
SET v=v+7
WHERE case_id='case_000007' AND node_hint=1;
SELECT COUNT(*), COALESCE(SUM(v),0)
FROM trial_case_000007_0
WHERE node_hint=1;
UPDATE trial_case_000007_1
SET v=v+7
WHERE case_id='case_000007' AND node_hint=1;
SELECT COUNT(*), COALESCE(SUM(v),0)
FROM trial_case_000007_1
WHERE node_hint=1;
UPDATE trial_case_000007_2
SET v=v+7
WHERE case_id='case_000007' AND node_hint=1;
SELECT COUNT(*), COALESCE(SUM(v),0)
FROM trial_case_000007_2
WHERE node_hint=1;
For client 2, use the same pattern with node_hint=2.
Run analogous scripts for clients 0 and 3 with node_hint=0 and 3.
7. While the 4 clients are running, keep polling:
ndb_mgm -e "ALL STATUS"
Observed failing symptom:
- client 1 and client 2 may fail on the first UPDATE with:
ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
- in the captured run, the failing side still showed:
Node 2: starting (Last completed phase 100)
Relevant captured timing:
- B-side node 2 failure visible in mysqld logs:
2026-06-07T13:28:23.760Z
- B-side node 2 subscribe/rejoin visible:
2026-06-07T13:28:39.431Z
So the thing to test is:
concurrent shard-local UPDATE during the short post-rejoin recovery window of one data node, especially with TimeBetweenGlobalCheckpoints set aggressively low.
Suggested fix:
Please investigate whether the post-rejoin recovery window after a single data node restart can expose unstable lock behavior for concurrent shard-local UPDATE statements, especially when TimeBetweenGlobalCheckpoints is set very low.
If the cluster is not yet ready for this kind of post-rejoin write workload, it would be better either:
1) to block the workload more clearly until the node is fully ready, or
2) to ensure that lock handling during the rejoin window does not produce unexpected ERROR 1205 for otherwise ordinary shard-local UPDATE statements.
Description: I observed an NDB lock wait timeout that appears during the post-rejoin window after restarting a single data node in MySQL Cluster Community Server 9.3.0-cluster. Environment: - MySQL Cluster Community Server 9.3.0-cluster - Linux / Docker-based environment - Per cluster: - 1 management node - 4 data nodes (ndbmtd) - 4 SQL/API nodes (mysqld) I first found this with two side-by-side clusters (A/B) under the same workload. The important part is: - same initial schema - same initial data - same workload shape - same action timing - only config difference: [ndbd default] TimeBetweenGlobalCheckpoints baseline = 2000 mutated = 20 What I did: 1. Created 3 NDB tables plus 1 canary table in database depstatepp_bughunt. 2. Loaded the same initial data into all 3 main tables (128 rows per table). 3. Used 4 concurrent SQL clients, one logical shard per node_hint value (0,1,2,3). 4. Performed a single-node rejoin action on data node 2. 5. In the post-action window, ran concurrent UPDATE statements from the 4 SQL clients. The post-action pattern for each client was: - UPSERT one canary row - UPDATE trial_case_000007_0 SET v=v+7 WHERE case_id='case_000007' AND node_hint=<shard> - SELECT COUNT(*), SUM(v) from that shard - UPDATE trial_case_000007_1 ... - SELECT COUNT(*), SUM(v) ... - UPDATE trial_case_000007_2 ... - SELECT COUNT(*), SUM(v) ... What I expected: I expected the post-rejoin workload either: 1) to succeed, or 2) to fail symmetrically / predictably if the cluster was not yet ready. What actually happened: Only the mutated side hit SQL runtime errors. Two SQL clients on the mutated side failed with: ERROR 1205 (HY000) at line 4: Lock wait timeout exceeded; try restarting transaction The affected clients were: - sql_client_1 (node_hint=1) - sql_client_2 (node_hint=2) In both client scripts, line 4 is the first post-action UPDATE, for example: UPDATE trial_case_000007_0 SET v=v+7 WHERE case_id='case_000007' AND node_hint=1 and UPDATE trial_case_000007_0 SET v=v+7 WHERE case_id='case_000007' AND node_hint=2 At the same time, the management status snapshot on the failing side still showed: Node 2: starting (Last completed phase 100) while the baseline side showed all 4 data nodes started. Relevant captured timing from the failing run: - B-side data node 2 failure became visible in SQL/API logs at: 2026-06-07T13:28:23.760Z - B-side data node 2 subscribe/rejoin activity became visible at: 2026-06-07T13:28:39.431Z The visible symptom is: after a single data node rejoin, while the restarted node is still in the recovery/start window, concurrent shard-local UPDATE statements may hit ERROR 1205 on the mutated configuration, even though the same workload completes on the baseline side. How to repeat: This issue appears to be timing-sensitive. The key is to run concurrent shard-local UPDATE statements in the immediate post-rejoin window after restarting one data node. Setup: - MySQL Cluster Community Server 9.3.0-cluster - 1 x ndb_mgmd - 4 x ndbmtd - 4 x mysqld Configuration used in the captured failing run: [ndbd default] TimeBetweenGlobalCheckpoints=20 Schema: CREATE DATABASE IF NOT EXISTS depstatepp_bughunt; USE depstatepp_bughunt; CREATE TABLE IF NOT EXISTS trial_case_000007_depstate_canary ( case_id VARCHAR(64) NOT NULL, table_idx INT NOT NULL, k INT NOT NULL, node_hint INT NOT NULL, payload VARCHAR(192) NOT NULL, v BIGINT NOT NULL, PRIMARY KEY(case_id, table_idx, k) ) ENGINE=NDBCLUSTER; CREATE TABLE IF NOT EXISTS trial_case_000007_0 ( case_id VARCHAR(64) NOT NULL, table_idx INT NOT NULL, k INT NOT NULL, node_hint INT NOT NULL, payload VARCHAR(192) NOT NULL, v BIGINT NOT NULL, PRIMARY KEY(case_id, table_idx, k), KEY idx_node_hint(node_hint) ) ENGINE=NDBCLUSTER; CREATE TABLE IF NOT EXISTS trial_case_000007_1 ( case_id VARCHAR(64) NOT NULL, table_idx INT NOT NULL, k INT NOT NULL, node_hint INT NOT NULL, payload VARCHAR(192) NOT NULL, v BIGINT NOT NULL, PRIMARY KEY(case_id, table_idx, k), KEY idx_node_hint(node_hint) ) ENGINE=NDBCLUSTER; CREATE TABLE IF NOT EXISTS trial_case_000007_2 ( case_id VARCHAR(64) NOT NULL, table_idx INT NOT NULL, k INT NOT NULL, node_hint INT NOT NULL, payload VARCHAR(192) NOT NULL, v BIGINT NOT NULL, PRIMARY KEY(case_id, table_idx, k), KEY idx_node_hint(node_hint) ) ENGINE=NDBCLUSTER; Initial data: - insert 128 rows into each of trial_case_000007_0 / 1 / 2 - use case_id='case_000007' - distribute node_hint across 0,1,2,3 - for example, k % 4 can determine node_hint Steps: 1. Start the cluster and wait until: ndb_mgm -e "ALL STATUS" shows all 4 data nodes started. 2. Load the schema and seed data above. 3. Open 4 concurrent SQL sessions. Map them logically as: - client 0 -> node_hint=0 - client 1 -> node_hint=1 - client 2 -> node_hint=2 - client 3 -> node_hint=3 4. From the management node, restart a single data node: ndb_mgm -e "2 RESTART" 5. Do not wait for long full stabilization. Instead, monitor: - ndb_mgm -e "ALL STATUS" - mysqld error log - especially whether node 2 is still "starting" 6. In the immediate post-rejoin window, while node 2 is still rejoining / starting, run these concurrent client scripts. For client 1: USE depstatepp_bughunt; INSERT INTO trial_case_000007_depstate_canary (case_id,table_idx,k,node_hint,payload,v) VALUES('case_000007',-1,1,1,'depstate_canary:case_000007:post_action:sql_client_1:1',900101) ON DUPLICATE KEY UPDATE node_hint=VALUES(node_hint), payload=VALUES(payload), v=VALUES(v); UPDATE trial_case_000007_0 SET v=v+7 WHERE case_id='case_000007' AND node_hint=1; SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000007_0 WHERE node_hint=1; UPDATE trial_case_000007_1 SET v=v+7 WHERE case_id='case_000007' AND node_hint=1; SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000007_1 WHERE node_hint=1; UPDATE trial_case_000007_2 SET v=v+7 WHERE case_id='case_000007' AND node_hint=1; SELECT COUNT(*), COALESCE(SUM(v),0) FROM trial_case_000007_2 WHERE node_hint=1; For client 2, use the same pattern with node_hint=2. Run analogous scripts for clients 0 and 3 with node_hint=0 and 3. 7. While the 4 clients are running, keep polling: ndb_mgm -e "ALL STATUS" Observed failing symptom: - client 1 and client 2 may fail on the first UPDATE with: ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction - in the captured run, the failing side still showed: Node 2: starting (Last completed phase 100) Relevant captured timing: - B-side node 2 failure visible in mysqld logs: 2026-06-07T13:28:23.760Z - B-side node 2 subscribe/rejoin visible: 2026-06-07T13:28:39.431Z So the thing to test is: concurrent shard-local UPDATE during the short post-rejoin recovery window of one data node, especially with TimeBetweenGlobalCheckpoints set aggressively low. Suggested fix: Please investigate whether the post-rejoin recovery window after a single data node restart can expose unstable lock behavior for concurrent shard-local UPDATE statements, especially when TimeBetweenGlobalCheckpoints is set very low. If the cluster is not yet ready for this kind of post-rejoin write workload, it would be better either: 1) to block the workload more clearly until the node is fully ready, or 2) to ensure that lock handling during the rejoin window does not produce unexpected ERROR 1205 for otherwise ordinary shard-local UPDATE statements.