Description:
I observed a post-rejoin SQL availability problem in MySQL Cluster Community Server 9.3.0-cluster.
Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- Per cluster:
- 1 management node
- 4 data nodes (ndbmtd)
- 4 SQL/API nodes (mysqld)
I first noticed this while running two side-by-side clusters, but the core symptom is simpler:
after restarting a single NDB data node, mysqld may still be unreachable on 127.0.0.1:3306 during the post-rejoin window.
Configuration used in the captured run:
- baseline side:
[ndbd default] TimeBetweenGlobalCheckpoints=2000
- mutated side:
[ndbd default] TimeBetweenGlobalCheckpoints=20
Cluster action:
- single data node rejoin
- management command:
ndb_mgm -e "2 RESTART"
What I did:
1. Started healthy NDB clusters.
2. Waited until SQL/API nodes were being used normally.
3. Triggered a single-node rejoin on data node 2.
4. In the post-action window, I attempted to run SQL clients against mysqld using:
mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306
5. All SQL clients on both captured sides failed immediately with:
ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111)
What I expected:
After a single data-node rejoin, SQL/API nodes should either:
- remain reachable on 127.0.0.1:3306, or
- fail in a clearly signaled "not ready yet" state with a more explicit readiness indication.
What actually happened:
All 8 captured SQL clients failed with the same connection-refused error:
ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111)
At freeze time, management status still showed incomplete data-node startup:
A side:
- Node 2: starting (Last completed phase 100)
- Nodes 3,4,5: started
B side:
- Node 2: starting (Last completed phase 1)
- Nodes 3,4,5: starting (Last completed phase 8)
So the visible symptom is:
after a single data-node rejoin, SQL traffic may be handed to mysqld before the local TCP endpoint 127.0.0.1:3306 is actually reachable.
How to repeat:
This issue appears to be timing-sensitive. The important part is to test SQL reachability immediately after a single data-node restart, during the rejoin/startup window.
Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd
- 4 x mysqld
Optional configuration used in the captured run:
[ndbd default]
TimeBetweenGlobalCheckpoints=20
Steps:
1. Start the cluster and wait until all data nodes are started:
ndb_mgm -e "ALL STATUS"
2. Confirm mysqld is reachable through TCP:
mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1"
3. From the management node, restart one data node:
ndb_mgm -e "2 RESTART"
4. Do not wait for full cluster stabilization.
Instead, immediately start polling both:
- management status:
ndb_mgm -e "ALL STATUS"
- SQL reachability:
mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1"
5. Repeat the SQL reachability check every 200 ms for 10-20 seconds.
For example:
for i in $(seq 1 100); do
date -u +"%FT%TZ.%3N"
ndb_mgm -e "ALL STATUS"
mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1" || true
sleep 0.2
done
Observed failing symptom in the captured run:
- mysqld traffic attempted in the post-rejoin window failed with:
ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111)
- at approximately the same time, management status still showed node startup in progress rather than all nodes fully started
The thing to check is whether mysqld becomes externally used / checked too early, before TCP 127.0.0.1:3306 is actually ready after a single-node rejoin.
Suggested fix:
Please investigate whether SQL/API readiness is being exposed too early during the post-rejoin startup window after restarting a single NDB data node.
If mysqld is not yet reachable on 127.0.0.1:3306, it would be better either:
1) to keep the node clearly marked as not ready for SQL traffic, or
2) to delay workload handoff / health success until TCP readiness is actually established.
Description: I observed a post-rejoin SQL availability problem in MySQL Cluster Community Server 9.3.0-cluster. Environment: - MySQL Cluster Community Server 9.3.0-cluster - Linux / Docker-based environment - Per cluster: - 1 management node - 4 data nodes (ndbmtd) - 4 SQL/API nodes (mysqld) I first noticed this while running two side-by-side clusters, but the core symptom is simpler: after restarting a single NDB data node, mysqld may still be unreachable on 127.0.0.1:3306 during the post-rejoin window. Configuration used in the captured run: - baseline side: [ndbd default] TimeBetweenGlobalCheckpoints=2000 - mutated side: [ndbd default] TimeBetweenGlobalCheckpoints=20 Cluster action: - single data node rejoin - management command: ndb_mgm -e "2 RESTART" What I did: 1. Started healthy NDB clusters. 2. Waited until SQL/API nodes were being used normally. 3. Triggered a single-node rejoin on data node 2. 4. In the post-action window, I attempted to run SQL clients against mysqld using: mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 5. All SQL clients on both captured sides failed immediately with: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111) What I expected: After a single data-node rejoin, SQL/API nodes should either: - remain reachable on 127.0.0.1:3306, or - fail in a clearly signaled "not ready yet" state with a more explicit readiness indication. What actually happened: All 8 captured SQL clients failed with the same connection-refused error: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111) At freeze time, management status still showed incomplete data-node startup: A side: - Node 2: starting (Last completed phase 100) - Nodes 3,4,5: started B side: - Node 2: starting (Last completed phase 1) - Nodes 3,4,5: starting (Last completed phase 8) So the visible symptom is: after a single data-node rejoin, SQL traffic may be handed to mysqld before the local TCP endpoint 127.0.0.1:3306 is actually reachable. How to repeat: This issue appears to be timing-sensitive. The important part is to test SQL reachability immediately after a single data-node restart, during the rejoin/startup window. Setup: - MySQL Cluster Community Server 9.3.0-cluster - 1 x ndb_mgmd - 4 x ndbmtd - 4 x mysqld Optional configuration used in the captured run: [ndbd default] TimeBetweenGlobalCheckpoints=20 Steps: 1. Start the cluster and wait until all data nodes are started: ndb_mgm -e "ALL STATUS" 2. Confirm mysqld is reachable through TCP: mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1" 3. From the management node, restart one data node: ndb_mgm -e "2 RESTART" 4. Do not wait for full cluster stabilization. Instead, immediately start polling both: - management status: ndb_mgm -e "ALL STATUS" - SQL reachability: mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1" 5. Repeat the SQL reachability check every 200 ms for 10-20 seconds. For example: for i in $(seq 1 100); do date -u +"%FT%TZ.%3N" ndb_mgm -e "ALL STATUS" mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1" || true sleep 0.2 done Observed failing symptom in the captured run: - mysqld traffic attempted in the post-rejoin window failed with: ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111) - at approximately the same time, management status still showed node startup in progress rather than all nodes fully started The thing to check is whether mysqld becomes externally used / checked too early, before TCP 127.0.0.1:3306 is actually ready after a single-node rejoin. Suggested fix: Please investigate whether SQL/API readiness is being exposed too early during the post-rejoin startup window after restarting a single NDB data node. If mysqld is not yet reachable on 127.0.0.1:3306, it would be better either: 1) to keep the node clearly marked as not ready for SQL traffic, or 2) to delay workload handoff / health success until TCP readiness is actually established.