Bug #120229 NDB 9.3.0-cluster shows inconsistent post-restart readines
Submitted: 8 Apr 3:08
Reporter: CunDi Fang Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:MySQL NDB Cluster 9.3.0-cluster OS:Linux
Assigned to: CPU Architecture:Any
Tags: arbitrationtimeout, convergence, mysqld, ndb, NDB_MGM, readiness, restart, socket

[8 Apr 3:08] CunDi Fang
Description:
I observed an NDB Cluster readiness/convergence inconsistency in MySQL Cluster Community Server 9.3.0-cluster.

Environment:
- Two independent 9-node NDB clusters used side by side for comparison
- Each cluster has 1 management node, 4 data nodes (ndbmtd), and 4 SQL/API nodes (mysqld)
- Server version reported by SQL nodes: 9.3.0-cluster
- Linux / Docker-based environment
- Test database: depstate_ab

What I did:
1. Started two identical clusters (A and B).
2. Verified that both clusters were healthy and that SQL queries worked on both sides.
3. Ran a baseline SQL sequence successfully on both clusters.
4. Between baseline and perturbation, changed only cluster B config:
   [ndbd default]
   ArbitrationTimeout = 9375
   (old value was 7500)
5. Restarted cluster B and waited until it converged.
6. Immediately after that, before sending the first perturbation SQL, I checked cluster state and SQL readiness.

What I expected:
- Since only cluster B was modified, cluster A should remain fully connected and stable.
- If a node is still not connected, I would expect mysqld/API readiness to reflect that consistently.
- Management status and SQL/API readiness should agree.

What actually happened:
- Baseline completed successfully on both clusters.
- The perturbation phase failed before the first perturbation SQL was dispatched.
- On cluster A, management status showed one data node still not connected:
  Node 5: not connected
- However, at the same time, all four SQL/API candidates on cluster A had a live local socket
  (/var/run/mysqld/mysqld.sock), and local socket-based SELECT 1 succeeded on all of them.
- In addition, the mysqld log on the node corresponding to the reconnecting data node reported:
  "connection[0], NodeID: 9, all storage nodes connected"
  and then
  "ready for connections"
- So the management view and the SQL/API readiness view were inconsistent in the same restart/recovery window.

This looks like a restart/readiness/convergence inconsistency in NDB 9.3.0-cluster, not just a normal SQL error.
I am attaching the management/status evidence, SQL node logs, and the exact baseline sequence used.

How to repeat:
The issue was captured in a differential test setup, but the core observable behavior can be checked manually without the test harness.

Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- Two independent clusters A and B
- Each cluster:
  - 1 x ndb_mgmd
  - 4 x ndbmtd data nodes
  - 4 x mysqld SQL/API nodes
- Local MySQL socket in each SQL node:
  /var/run/mysqld/mysqld.sock
- Database:
  depstate_ab

Baseline sequence (executed successfully on both clusters before the failure):
0. CREATE TABLE IF NOT EXISTS `depstate_ab`.dep_pair_kv (id INT PRIMARY KEY, value INT, note VARCHAR(64)) ENGINE=NDBCLUSTER;
1. DELETE FROM `depstate_ab`.dep_pair_kv;
2. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (1,280,'vis_0') ON DUPLICATE KEY UPDATE value=280, note='vis_0';
3. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=2;
4. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv;
5. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_3' WHERE id=4;
6. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 1 AND 9;
7. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (6,168,'vis_5') ON DUPLICATE KEY UPDATE value=168, note='vis_5';
8. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=7;
9. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv;
10. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_8' WHERE id=9;
11. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 6 AND 14;
12. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (11,130,'vis_10') ON DUPLICATE KEY UPDATE value=130, note='vis_10';
13. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=12;
14. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv;
15. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_13' WHERE id=14;
16. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 11 AND 19;
17. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (16,575,'vis_15') ON DUPLICATE KEY UPDATE value=575, note='vis_15';

Then:

1. Keep cluster A unchanged.
2. On cluster B only, change:
   [ndbd default]
   ArbitrationTimeout=9375
   (previous value was 7500)
3. Restart cluster B and wait until it appears converged / ready.
4. Immediately after that (before sending any new SQL workload), check cluster A in this order:

   a) Run on cluster A management node:
      ndb_mgm -e "SHOW"
      ndb_mgm -e "ALL STATUS"

   b) On each SQL/API node of cluster A (ndb1, ndb2, ndb3, ndb4), check:
      ls -l /var/run/mysqld/mysqld.sock

   c) On each SQL/API node of cluster A, run:
      mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT @@hostname, @@port, @@version; USE depstate_ab; SELECT 1;"

Observed in the captured failing run:
- Cluster A candidate containers had start timestamps around:
  2026-04-07T13:35:58Z to 2026-04-07T13:36:01Z
- On the SQL/API node corresponding to cluster A ndb4, mysqld log showed:
  2026-04-07T13:36:52.160057Z  connection[0], NodeID: 9, all storage nodes connected
  2026-04-07T13:36:52.474090Z  ready for connections
- At the same time, management/status output on cluster A repeatedly still showed:
  Node 5: not connected (accepting connect from ndb4)
- Also at the same time, all four SQL/API candidates on cluster A had a live socket and local SELECT 1 succeeded.

So the repeatable symptom to check is:
management status says one data node is not connected, while SQL/API nodes report all storage nodes connected and are queryable via local socket in the same post-restart window.

Suggested fix:
Please check whether post-restart readiness / convergence state can be reported inconsistently between:
1) ndb_mgm management status,
2) mysqld/API-side "all storage nodes connected" state, and
3) local SQL socket readiness.

If one data node is still not connected, the SQL/API node should probably not report full storage connectivity / readiness yet, or the state transition should be serialized more clearly.