MySQL Bugs: #120232: NDB 9.3.0-cluster may briefly lose stable SQL/API socket readiness after full-cluster restart

Bug #120232	NDB 9.3.0-cluster may briefly lose stable SQL/API socket readiness after full-cluster restart
Submitted:	8 Apr 7:07
Reporter:	CunDi Fang	Email Updates:
Status:	Open	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	MySQL NDB Cluster 9.3.0-cluster	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	checkpoint, ndb, readiness, restart, socket, TimeBetweenGlobalCheckpoints
Description:
I observed a short post-restart SQL/API readiness instability in MySQL Cluster Community Server 9.3.0-cluster after changing a checkpoint-related NDB parameter.

Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- 1 management node, 4 data nodes (ndbmtd), 4 SQL/API nodes (mysqld)
- Local SQL socket on each SQL/API node:
  /var/run/mysqld/mysqld.sock
- Test database:
  depstate_ab

What I did:
1. Started a healthy NDB cluster.
2. Verified that all data nodes were started and local socket SQL worked.
3. Ran a short baseline workload successfully.
4. Changed this config in [ndbd default]:
   TimeBetweenGlobalCheckpoints = 1500
   (old value was 2000)
5. Restarted the cluster.
6. Waited for the SQL/API nodes to report:
   - "all storage nodes connected"
   - "ready for connections"
7. Immediately after restart, before sending the next workload, I checked short-interval readiness on the SQL/API nodes.

What I expected:
After all SQL/API nodes report "all storage nodes connected" and "ready for connections", local socket-based SQL readiness should remain stable.

What I observed:
All 4 SQL/API nodes on the restarted cluster reported successful startup in the same 1-second window:

- ndb3:
  2026-04-02T23:36:59.290510Z  all storage nodes connected
  2026-04-02T23:36:59.518929Z  ready for connections

- ndb4:
  2026-04-02T23:36:59.338113Z  all storage nodes connected
  2026-04-02T23:36:59.583944Z  ready for connections

- ndb2:
  2026-04-02T23:36:59.341932Z  all storage nodes connected
  2026-04-02T23:36:59.571135Z  ready for connections

- ndb1:
  2026-04-02T23:36:59.361833Z  all storage nodes connected
  2026-04-02T23:36:59.602983Z  ready for connections

However, an immediate 3-sample post-restart stability check still failed:
- 2 consecutive samples were OK
- the last failing sample was:
  ndb4:no-live-mysql-socket-found

This aborted the next workload dispatch before any new SQL was sent.

A later diagnostic snapshot already showed the cluster healthy again:
- ndb_mgm ALL STATUS showed all data nodes started
- all 4 SQL/API nodes had a live local socket
- local SELECT 1 succeeded on all 4 SQL/API nodes

So this does not look like a permanent outage. It looks like a short readiness flap immediately after restart: the SQL/API nodes already report ready, but local socket/query readiness is still not stable enough for immediate use.

How to repeat:
This issue appears to be timing-sensitive. The key is to probe the SQL/API nodes immediately after restart.

Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd data nodes
- 4 x mysqld SQL/API nodes
- Each SQL/API node has local socket:
  /var/run/mysqld/mysqld.sock

Initial config:
[ndbd default]
TimeBetweenGlobalCheckpoints=2000

1. Start the cluster and wait until:
   ndb_mgm -e "ALL STATUS"
   shows all data nodes as started.

2. On each SQL/API node, verify:
   mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT 1;"

3. Run a small NDB workload, for example:
   CREATE DATABASE IF NOT EXISTS depstate_ab;
   USE depstate_ab;
   CREATE TABLE dep_pair_kv (
     id INT PRIMARY KEY,
     value INT,
     note VARCHAR(64)
   ) ENGINE=NDBCLUSTER;
   DELETE FROM dep_pair_kv;
   INSERT INTO dep_pair_kv VALUES (1,538,'seed_0'),(7,946,'seed_6'),(13,420,'seed_12');
   SELECT COUNT(*) FROM dep_pair_kv;

4. Stop the cluster cleanly.

5. Change config:
   [ndbd default]
   TimeBetweenGlobalCheckpoints=1500

6. Start the cluster again.

7. Watch the mysqld logs on all 4 SQL/API nodes. As soon as they begin printing:
   "all storage nodes connected"
   and
   "ready for connections"
   start polling immediately for 5-15 seconds.

8. During that post-restart window, repeatedly run:
   a) on management node:
      ndb_mgm -e "ALL STATUS"
   b) on each SQL/API node:
      ls -l /var/run/mysqld/mysqld.sock
   c) on each SQL/API node:
      mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -Nse "SELECT 1"

A practical way is to poll every 200 ms:

for i in $(seq 1 75); do
  date -u +"%FT%TZ.%3N"
  ndb_mgm -e "ALL STATUS"
  for n in ndb1 ndb2 ndb3 ndb4; do
    docker exec <container-for-$n> bash -lc '
      ts=$(date -u +"%FT%TZ.%3N")
      sock=/var/run/mysqld/mysqld.sock
      if [ -S "$sock" ]; then
        mysql --protocol=SOCKET -uroot -S "$sock" -Nse "SELECT 1" >/tmp/out.$$ 2>/tmp/err.$$; rc=$?
        echo "$ts $(hostname) socket=present rc=$rc out=$(cat /tmp/out.$$ 2>/dev/null) err=$(tr "\n" " " </tmp/err.$$ 2>/dev/null)"
      else
        echo "$ts $(hostname) socket=missing"
      fi'
  done
  sleep 0.2
done

Observed failing run:
- All 4 SQL/API nodes logged startup success around:
  2026-04-02T23:36:59.290Z to 2026-04-02T23:36:59.603Z
- In that immediate post-restart window, one SQL/API node (ndb4 in the captured run) briefly failed the local socket readiness check:
  no-live-mysql-socket-found
- A later snapshot already showed the node healthy again.

So the symptom to look for is:
a very short post-restart window in which a SQL/API node is not yet stably queryable even though all nodes have already logged "all storage nodes connected" and "ready for connections".

Suggested fix:
Please check whether SQL/API readiness is being reported slightly too early after a full cluster restart when TimeBetweenGlobalCheckpoints is changed.

The SQL/API node appears to log "all storage nodes connected" and "ready for connections" before local socket/query readiness is fully stable. A stricter readiness transition, or a short stabilization barrier before reporting ready, may help.