Bug #120626 NDB data diverges after SHOW and concurrent updates
Submitted: 8 Jun 7:54
Reporter: cundi fang Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:9.3.0 OS:Ubuntu (22.04)
Assigned to: CPU Architecture:Any
Tags: data-divergence, lock-wait-timeout, ndb, NDB_MGM, TimeBetweenGlobalCheckpoints

[8 Jun 7:54] cundi fang
Description:
I found a reproducible NDB data divergence candidate in MySQL Cluster Community Server 9.3.0-cluster.

I used two otherwise identical NDB clusters side by side (A and B) to detect behavioral differences under the same data seed and the same SQL workload. The only intentional difference was:

- Cluster A: default / baseline setting
- Cluster B: [ndbd default] TimeBetweenGlobalCheckpoints changed from 2000 to 20

Topology on each side:
- 1 management node
- 4 data nodes (ndbmtd)
- 4 SQL/API nodes (mysqld)

Both sides were healthy before the failing phase:
- all data nodes were started
- all SQL/API nodes were connected
- local SQL health checks succeeded
- metadata / table definitions matched on both sides

Test shape:
- database: depstatepp_bughunt
- 4 NDB tables involved:
  - trial_case_000000_depstate_canary
  - trial_case_000000_0
  - trial_case_000000_1
  - trial_case_000000_2
- all tables use ENGINE=NDBCLUSTER
- the 3 main trial tables have the same schema pattern:
  (case_id, table_idx, k, node_hint, payload, v)
  with PRIMARY KEY(case_id, table_idx, k)
  and KEY idx_node_hint(node_hint)

What I did:
1. Seeded both clusters with the same initial data.
2. Ran the same pre-action SQL on both sides using 4 concurrent SQL clients.
3. Executed a management metadata command on both sides:
   ndb_mgm -e SHOW
4. Immediately after that, ran the same post-action SQL on both sides, again with 4 concurrent SQL clients.

The post-action SQL pattern was:
- one canary INSERT/UPSERT per client
- then, for each client shard, 3 UPDATE statements:
  UPDATE trial_case_000000_X
  SET v = v + 7
  WHERE case_id='case_000000' AND node_hint=<shard>
- each UPDATE was followed by a COUNT/SUM query

What I expected:
Since the two sides had the same seed data, the same table definitions, the same client-sharded SQL, and all updates were logically shard-local by node_hint, I expected both sides either:
1) to complete successfully and end in the same final data state, or
2) to fail in the same way.

What actually happened:
- On cluster A, SQL client 3 (running on depstate-a-ndb4) failed in the post-action phase with:
  ERROR 1205 (HY000): Lock wait timeout exceeded; try restarting transaction
- The failure happened at line 4 of that client script, i.e. on the first post-action UPDATE:
  UPDATE trial_case_000000_0
  SET v = v + 7
  WHERE case_id='case_000000' AND node_hint=3
- On cluster B, the corresponding client completed successfully.
- After the run, both sides were still healthy and metadata remained identical.
- Row counts were also identical:
  101 rows on A and 101 rows on B
- Table counts were identical:
  5 tables on A and 5 tables on B
- But the final data-state hashes diverged:
  A: 09f576c34eeacff9340d0406c435c354e245ff1e1b952d8b2a4605d37157835d
  B: 22e1afacae64c65fd32b00a146133be871eaec4b9556d770d66186abed2ce100

The total value sum also diverged:
- A total v sum: 6038602
- B total v sum: 6038770

The difference is exactly 168, which matches 24 rows not receiving the expected +7 update
(8 rows per table × 3 tables × 7 = 168).

So the visible symptom is:
one side gets a lock wait timeout in a shard-local post-action UPDATE, the other side succeeds, and the final NDB data state diverges even though metadata and row counts remain identical.

How to repeat:
This issue is easiest to observe with two side-by-side NDB clusters, because the bug is a final state divergence under the same workload.

Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker
- Per cluster:
  - 1 x ndb_mgmd
  - 4 x ndbmtd
  - 4 x mysqld (API / SQL)
- Same schema and same seed data on both sides

Config:
- Cluster A:
  [ndbd default]
  TimeBetweenGlobalCheckpoints=2000
- Cluster B:
  [ndbd default]
  TimeBetweenGlobalCheckpoints=20

Timeline / sequence:

T0. Start both clusters and wait until both are healthy.
On both sides, verify:
- ndb_mgm -e SHOW shows 4 data nodes and 4 mysqld(API) nodes connected
- local SQL health check succeeds (for example SELECT 1)

T1. On both sides, create the same schema:

CREATE DATABASE IF NOT EXISTS depstatepp_bughunt;
USE depstatepp_bughunt;

CREATE TABLE IF NOT EXISTS trial_case_000000_depstate_canary (
  case_id VARCHAR(64) NOT NULL,
  table_idx INT NOT NULL,
  k INT NOT NULL,
  node_hint INT NOT NULL,
  payload VARCHAR(192) NOT NULL,
  v BIGINT NOT NULL,
  PRIMARY KEY(case_id, table_idx, k)
) ENGINE=NDBCLUSTER;

CREATE TABLE IF NOT EXISTS trial_case_000000_0 (
  case_id VARCHAR(64) NOT NULL,
  table_idx INT NOT NULL,
  k INT NOT NULL,
  node_hint INT NOT NULL,
  payload VARCHAR(192) NOT NULL,
  v BIGINT NOT NULL,
  PRIMARY KEY(case_id, table_idx, k),
  KEY idx_node_hint(node_hint)
) ENGINE=NDBCLUSTER;

CREATE TABLE IF NOT EXISTS trial_case_000000_1 (
  case_id VARCHAR(64) NOT NULL,
  table_idx INT NOT NULL,
  k INT NOT NULL,
  node_hint INT NOT NULL,
  payload VARCHAR(192) NOT NULL,
  v BIGINT NOT NULL,
  PRIMARY KEY(case_id, table_idx, k),
  KEY idx_node_hint(node_hint)
) ENGINE=NDBCLUSTER;

CREATE TABLE IF NOT EXISTS trial_case_000000_2 (
  case_id VARCHAR(64) NOT NULL,
  table_idx INT NOT NULL,
  k INT NOT NULL,
  node_hint INT NOT NULL,
  payload VARCHAR(192) NOT NULL,
  v BIGINT NOT NULL,
  PRIMARY KEY(case_id, table_idx, k),
  KEY idx_node_hint(node_hint)
) ENGINE=NDBCLUSTER;

T2. Seed both sides with the same initial data.
Important properties of the seed:
- same case_id = 'case_000000'
- same rows on A and B
- 32 rows in each of the 3 main trial tables
- node_hint distributed across 0,1,2,3
- 5 canary rows in the canary table

T3. Run the same pre-action workload on both clusters using 4 concurrent SQL clients.
Each client is responsible for one shard/node_hint:
- client 0 -> node_hint=0
- client 1 -> node_hint=1
- client 2 -> node_hint=2
- client 3 -> node_hint=3

T4. Immediately after pre-action completes, run on both management nodes:
  ndb_mgm -e SHOW

T5. Immediately after T4, run the following post-action workload concurrently on both clusters, again with 4 SQL clients.

For client 3 (the shard that failed in the captured run), the SQL is:

USE depstatepp_bughunt;

INSERT INTO trial_case_000000_depstate_canary
(case_id,table_idx,k,node_hint,payload,v)
VALUES('case_000000',-1,3,3,'depstate_canary:case_000000:post_action:sql_client_3:3',900303)
ON DUPLICATE KEY UPDATE
  node_hint=VALUES(node_hint),
  payload=VALUES(payload),
  v=VALUES(v);

UPDATE trial_case_000000_0
SET v=v+7
WHERE case_id='case_000000' AND node_hint=3;

SELECT COUNT(*), COALESCE(SUM(v),0)
FROM trial_case_000000_0
WHERE node_hint=3;

UPDATE trial_case_000000_1
SET v=v+7
WHERE case_id='case_000000' AND node_hint=3;

SELECT COUNT(*), COALESCE(SUM(v),0)
FROM trial_case_000000_1
WHERE node_hint=3;

UPDATE trial_case_000000_2
SET v=v+7
WHERE case_id='case_000000' AND node_hint=3;

SELECT COUNT(*), COALESCE(SUM(v),0)
FROM trial_case_000000_2
WHERE node_hint=3;

At the same time, run the analogous node_hint=0/1/2 scripts in the other 3 clients.

T6. Compare the results on both sides.
In the captured failing run:
- A/client 3 failed with:
  ERROR 1205 (HY000) at line 4: Lock wait timeout exceeded; try restarting transaction
- B/client 3 completed successfully
- all other visible health checks remained OK

T7. Compare final state, not only metadata.
A simple whole-state check is:
- total row count across the test tables
- total SUM(v)
- ordered hash over all rows in the 4 test tables

Observed failing result:
- same row count on both sides
- same table metadata on both sides
- different final data-state hash
- total SUM(v) differed by 168

This suggests a real final-state divergence, not just a transient client-side timeout.

Suggested fix:
Please investigate whether a management metadata command (ndb_mgm -e SHOW) executed between two phases of a concurrent NDB workload can interact badly with transaction / lock handling, especially when TimeBetweenGlobalCheckpoints is changed aggressively.

The unexpected part is not only the lock wait timeout on one side, but the fact that two otherwise identical runs end in different final NDB data states even though metadata and row counts remain identical.

At minimum, it would be helpful to understand:
1) why a shard-local UPDATE on node_hint=3 can hit ERROR 1205 here,
2) why the corresponding run on the other side succeeds,
3) whether the failed transaction path can lead to a partially divergent final state under this workload pattern.