Description:
I observed an NDB Cluster readiness/convergence inconsistency in MySQL Cluster Community Server 9.3.0-cluster.
Environment:
- Two independent 9-node NDB clusters used side by side for comparison
- Each cluster has 1 management node, 4 data nodes (ndbmtd), and 4 SQL/API nodes (mysqld)
- Server version reported by SQL nodes: 9.3.0-cluster
- Linux / Docker-based environment
- Test database: depstate_ab
What I did:
1. Started two identical clusters (A and B).
2. Verified that both clusters were healthy and that SQL queries worked on both sides.
3. Ran a baseline SQL sequence successfully on both clusters.
4. Between baseline and perturbation, changed only cluster B config:
[ndbd default]
ArbitrationTimeout = 9375
(old value was 7500)
5. Restarted cluster B and waited until it converged.
6. Immediately after that, before sending the first perturbation SQL, I checked cluster state and SQL readiness.
What I expected:
- Since only cluster B was modified, cluster A should remain fully connected and stable.
- If a node is still not connected, I would expect mysqld/API readiness to reflect that consistently.
- Management status and SQL/API readiness should agree.
What actually happened:
- Baseline completed successfully on both clusters.
- The perturbation phase failed before the first perturbation SQL was dispatched.
- On cluster A, management status showed one data node still not connected:
Node 5: not connected
- However, at the same time, all four SQL/API candidates on cluster A had a live local socket
(/var/run/mysqld/mysqld.sock), and local socket-based SELECT 1 succeeded on all of them.
- In addition, the mysqld log on the node corresponding to the reconnecting data node reported:
"connection[0], NodeID: 9, all storage nodes connected"
and then
"ready for connections"
- So the management view and the SQL/API readiness view were inconsistent in the same restart/recovery window.
This looks like a restart/readiness/convergence inconsistency in NDB 9.3.0-cluster, not just a normal SQL error.
I am attaching the management/status evidence, SQL node logs, and the exact baseline sequence used.
How to repeat:
The issue was captured in a differential test setup, but the core observable behavior can be checked manually without the test harness.
Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- Two independent clusters A and B
- Each cluster:
- 1 x ndb_mgmd
- 4 x ndbmtd data nodes
- 4 x mysqld SQL/API nodes
- Local MySQL socket in each SQL node:
/var/run/mysqld/mysqld.sock
- Database:
depstate_ab
Baseline sequence (executed successfully on both clusters before the failure):
0. CREATE TABLE IF NOT EXISTS `depstate_ab`.dep_pair_kv (id INT PRIMARY KEY, value INT, note VARCHAR(64)) ENGINE=NDBCLUSTER;
1. DELETE FROM `depstate_ab`.dep_pair_kv;
2. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (1,280,'vis_0') ON DUPLICATE KEY UPDATE value=280, note='vis_0';
3. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=2;
4. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv;
5. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_3' WHERE id=4;
6. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 1 AND 9;
7. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (6,168,'vis_5') ON DUPLICATE KEY UPDATE value=168, note='vis_5';
8. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=7;
9. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv;
10. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_8' WHERE id=9;
11. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 6 AND 14;
12. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (11,130,'vis_10') ON DUPLICATE KEY UPDATE value=130, note='vis_10';
13. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=12;
14. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv;
15. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_13' WHERE id=14;
16. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 11 AND 19;
17. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (16,575,'vis_15') ON DUPLICATE KEY UPDATE value=575, note='vis_15';
Then:
1. Keep cluster A unchanged.
2. On cluster B only, change:
[ndbd default]
ArbitrationTimeout=9375
(previous value was 7500)
3. Restart cluster B and wait until it appears converged / ready.
4. Immediately after that (before sending any new SQL workload), check cluster A in this order:
a) Run on cluster A management node:
ndb_mgm -e "SHOW"
ndb_mgm -e "ALL STATUS"
b) On each SQL/API node of cluster A (ndb1, ndb2, ndb3, ndb4), check:
ls -l /var/run/mysqld/mysqld.sock
c) On each SQL/API node of cluster A, run:
mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT @@hostname, @@port, @@version; USE depstate_ab; SELECT 1;"
Observed in the captured failing run:
- Cluster A candidate containers had start timestamps around:
2026-04-07T13:35:58Z to 2026-04-07T13:36:01Z
- On the SQL/API node corresponding to cluster A ndb4, mysqld log showed:
2026-04-07T13:36:52.160057Z connection[0], NodeID: 9, all storage nodes connected
2026-04-07T13:36:52.474090Z ready for connections
- At the same time, management/status output on cluster A repeatedly still showed:
Node 5: not connected (accepting connect from ndb4)
- Also at the same time, all four SQL/API candidates on cluster A had a live socket and local SELECT 1 succeeded.
So the repeatable symptom to check is:
management status says one data node is not connected, while SQL/API nodes report all storage nodes connected and are queryable via local socket in the same post-restart window.
Suggested fix:
Please check whether post-restart readiness / convergence state can be reported inconsistently between:
1) ndb_mgm management status,
2) mysqld/API-side "all storage nodes connected" state, and
3) local SQL socket readiness.
If one data node is still not connected, the SQL/API node should probably not report full storage connectivity / readiness yet, or the state transition should be serialized more clearly.
Description: I observed an NDB Cluster readiness/convergence inconsistency in MySQL Cluster Community Server 9.3.0-cluster. Environment: - Two independent 9-node NDB clusters used side by side for comparison - Each cluster has 1 management node, 4 data nodes (ndbmtd), and 4 SQL/API nodes (mysqld) - Server version reported by SQL nodes: 9.3.0-cluster - Linux / Docker-based environment - Test database: depstate_ab What I did: 1. Started two identical clusters (A and B). 2. Verified that both clusters were healthy and that SQL queries worked on both sides. 3. Ran a baseline SQL sequence successfully on both clusters. 4. Between baseline and perturbation, changed only cluster B config: [ndbd default] ArbitrationTimeout = 9375 (old value was 7500) 5. Restarted cluster B and waited until it converged. 6. Immediately after that, before sending the first perturbation SQL, I checked cluster state and SQL readiness. What I expected: - Since only cluster B was modified, cluster A should remain fully connected and stable. - If a node is still not connected, I would expect mysqld/API readiness to reflect that consistently. - Management status and SQL/API readiness should agree. What actually happened: - Baseline completed successfully on both clusters. - The perturbation phase failed before the first perturbation SQL was dispatched. - On cluster A, management status showed one data node still not connected: Node 5: not connected - However, at the same time, all four SQL/API candidates on cluster A had a live local socket (/var/run/mysqld/mysqld.sock), and local socket-based SELECT 1 succeeded on all of them. - In addition, the mysqld log on the node corresponding to the reconnecting data node reported: "connection[0], NodeID: 9, all storage nodes connected" and then "ready for connections" - So the management view and the SQL/API readiness view were inconsistent in the same restart/recovery window. This looks like a restart/readiness/convergence inconsistency in NDB 9.3.0-cluster, not just a normal SQL error. I am attaching the management/status evidence, SQL node logs, and the exact baseline sequence used. How to repeat: The issue was captured in a differential test setup, but the core observable behavior can be checked manually without the test harness. Setup: - MySQL Cluster Community Server 9.3.0-cluster - Two independent clusters A and B - Each cluster: - 1 x ndb_mgmd - 4 x ndbmtd data nodes - 4 x mysqld SQL/API nodes - Local MySQL socket in each SQL node: /var/run/mysqld/mysqld.sock - Database: depstate_ab Baseline sequence (executed successfully on both clusters before the failure): 0. CREATE TABLE IF NOT EXISTS `depstate_ab`.dep_pair_kv (id INT PRIMARY KEY, value INT, note VARCHAR(64)) ENGINE=NDBCLUSTER; 1. DELETE FROM `depstate_ab`.dep_pair_kv; 2. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (1,280,'vis_0') ON DUPLICATE KEY UPDATE value=280, note='vis_0'; 3. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=2; 4. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv; 5. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_3' WHERE id=4; 6. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 1 AND 9; 7. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (6,168,'vis_5') ON DUPLICATE KEY UPDATE value=168, note='vis_5'; 8. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=7; 9. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv; 10. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_8' WHERE id=9; 11. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 6 AND 14; 12. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (11,130,'vis_10') ON DUPLICATE KEY UPDATE value=130, note='vis_10'; 13. SELECT id,value,note FROM `depstate_ab`.dep_pair_kv WHERE id=12; 14. SELECT COUNT(*), COALESCE(MAX(value),0) FROM `depstate_ab`.dep_pair_kv; 15. UPDATE `depstate_ab`.dep_pair_kv SET note='meta_13' WHERE id=14; 16. SELECT COALESCE(SUM(value),0) FROM `depstate_ab`.dep_pair_kv WHERE id BETWEEN 11 AND 19; 17. INSERT INTO `depstate_ab`.dep_pair_kv(id,value,note) VALUES (16,575,'vis_15') ON DUPLICATE KEY UPDATE value=575, note='vis_15'; Then: 1. Keep cluster A unchanged. 2. On cluster B only, change: [ndbd default] ArbitrationTimeout=9375 (previous value was 7500) 3. Restart cluster B and wait until it appears converged / ready. 4. Immediately after that (before sending any new SQL workload), check cluster A in this order: a) Run on cluster A management node: ndb_mgm -e "SHOW" ndb_mgm -e "ALL STATUS" b) On each SQL/API node of cluster A (ndb1, ndb2, ndb3, ndb4), check: ls -l /var/run/mysqld/mysqld.sock c) On each SQL/API node of cluster A, run: mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT @@hostname, @@port, @@version; USE depstate_ab; SELECT 1;" Observed in the captured failing run: - Cluster A candidate containers had start timestamps around: 2026-04-07T13:35:58Z to 2026-04-07T13:36:01Z - On the SQL/API node corresponding to cluster A ndb4, mysqld log showed: 2026-04-07T13:36:52.160057Z connection[0], NodeID: 9, all storage nodes connected 2026-04-07T13:36:52.474090Z ready for connections - At the same time, management/status output on cluster A repeatedly still showed: Node 5: not connected (accepting connect from ndb4) - Also at the same time, all four SQL/API candidates on cluster A had a live socket and local SELECT 1 succeeded. So the repeatable symptom to check is: management status says one data node is not connected, while SQL/API nodes report all storage nodes connected and are queryable via local socket in the same post-restart window. Suggested fix: Please check whether post-restart readiness / convergence state can be reported inconsistently between: 1) ndb_mgm management status, 2) mysqld/API-side "all storage nodes connected" state, and 3) local SQL socket readiness. If one data node is still not connected, the SQL/API node should probably not report full storage connectivity / readiness yet, or the state transition should be serialized more clearly.