Bug #120629 MySQL not reachable after NDB node rejoin
Submitted: 8 Jun 8:12
Reporter: cundi fang Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:9.3.0 OS:Any
Assigned to: CPU Architecture:Any
Tags: connection-refused, error-2003, mysql-startup, ndb, node-rejoin, readiness

[8 Jun 8:12] cundi fang
Description:
I observed a post-rejoin SQL availability problem in MySQL Cluster Community Server 9.3.0-cluster.

Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- Per cluster:
  - 1 management node
  - 4 data nodes (ndbmtd)
  - 4 SQL/API nodes (mysqld)

I first noticed this while running two side-by-side clusters, but the core symptom is simpler:
after restarting a single NDB data node, mysqld may still be unreachable on 127.0.0.1:3306 during the post-rejoin window.

Configuration used in the captured run:
- baseline side:
  [ndbd default] TimeBetweenGlobalCheckpoints=2000
- mutated side:
  [ndbd default] TimeBetweenGlobalCheckpoints=20

Cluster action:
- single data node rejoin
- management command:
  ndb_mgm -e "2 RESTART"

What I did:
1. Started healthy NDB clusters.
2. Waited until SQL/API nodes were being used normally.
3. Triggered a single-node rejoin on data node 2.
4. In the post-action window, I attempted to run SQL clients against mysqld using:
   mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306
5. All SQL clients on both captured sides failed immediately with:
   ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111)

What I expected:
After a single data-node rejoin, SQL/API nodes should either:
- remain reachable on 127.0.0.1:3306, or
- fail in a clearly signaled "not ready yet" state with a more explicit readiness indication.

What actually happened:
All 8 captured SQL clients failed with the same connection-refused error:
  ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111)

At freeze time, management status still showed incomplete data-node startup:

A side:
- Node 2: starting (Last completed phase 100)
- Nodes 3,4,5: started

B side:
- Node 2: starting (Last completed phase 1)
- Nodes 3,4,5: starting (Last completed phase 8)

So the visible symptom is:
after a single data-node rejoin, SQL traffic may be handed to mysqld before the local TCP endpoint 127.0.0.1:3306 is actually reachable.

How to repeat:
This issue appears to be timing-sensitive. The important part is to test SQL reachability immediately after a single data-node restart, during the rejoin/startup window.

Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd
- 4 x mysqld

Optional configuration used in the captured run:
[ndbd default]
TimeBetweenGlobalCheckpoints=20

Steps:

1. Start the cluster and wait until all data nodes are started:
   ndb_mgm -e "ALL STATUS"

2. Confirm mysqld is reachable through TCP:
   mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1"

3. From the management node, restart one data node:
   ndb_mgm -e "2 RESTART"

4. Do not wait for full cluster stabilization.
   Instead, immediately start polling both:
   - management status:
     ndb_mgm -e "ALL STATUS"
   - SQL reachability:
     mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1"

5. Repeat the SQL reachability check every 200 ms for 10-20 seconds.

For example:

for i in $(seq 1 100); do
  date -u +"%FT%TZ.%3N"
  ndb_mgm -e "ALL STATUS"
  mysql -udepstate -pdepstate -h 127.0.0.1 -P 3306 -e "SELECT 1" || true
  sleep 0.2
done

Observed failing symptom in the captured run:
- mysqld traffic attempted in the post-rejoin window failed with:
  ERROR 2003 (HY000): Can't connect to MySQL server on '127.0.0.1:3306' (111)
- at approximately the same time, management status still showed node startup in progress rather than all nodes fully started

The thing to check is whether mysqld becomes externally used / checked too early, before TCP 127.0.0.1:3306 is actually ready after a single-node rejoin.

Suggested fix:
Please investigate whether SQL/API readiness is being exposed too early during the post-rejoin startup window after restarting a single NDB data node.

If mysqld is not yet reachable on 127.0.0.1:3306, it would be better either:
1) to keep the node clearly marked as not ready for SQL traffic, or
2) to delay workload handoff / health success until TCP readiness is actually established.