Bug #120230 brief post-restart SQL/API readiness instability after TimeBetweenEpochs change
Submitted: 8 Apr 4:35
Reporter: CunDi Fang Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:MySQL NDB Cluster 9.3.0-cluster OS:Any
Assigned to: CPU Architecture:Any
Tags: availability, checkpoint, ndb, readiness, restart, socket, timebetweenepochs

[8 Apr 4:35] CunDi Fang
Description:
I observed a post-restart readiness / availability instability in MySQL Cluster Community Server 9.3.0-cluster.

Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- 1 management node, 4 data nodes (ndbmtd), 4 SQL/API nodes (mysqld)
- Local SQL socket path on each SQL/API node:
  /var/run/mysqld/mysqld.sock
- Test database:
  depstate_ab

I originally captured this while comparing two independent clusters side by side, but the core symptom concerns a single restarted cluster and does not depend on the differential harness.

What I did:
1. Started a healthy NDB cluster and verified that all data nodes were started.
2. Verified that all 4 SQL/API nodes could execute local socket-based SQL successfully.
3. Ran a short baseline workload successfully.
4. Changed this config in [ndbd default]:
   TimeBetweenEpochs = 75
   (old value was 100)
5. Restarted the cluster.
6. After restart, I waited until the SQL/API nodes began reporting:
   - "all storage nodes connected"
   - "ready for connections"
7. Immediately in that post-restart window, I checked socket/query availability on all SQL/API nodes.

What I expected:
Once the cluster reports that all storage nodes are connected and each SQL/API node reports "ready for connections", local socket-based SQL availability should remain stable.

What I observed:
In the captured failing run, all 4 SQL/API nodes on the restarted cluster reported successful startup in a very tight time window:

- ndb2:
  2026-04-07T13:06:19.265764Z  all storage nodes connected
  2026-04-07T13:06:19.728736Z  ready for connections

- ndb4:
  2026-04-07T13:06:19.823562Z  all storage nodes connected
  2026-04-07T13:06:20.085269Z  ready for connections

- ndb1:
  2026-04-07T13:06:20.130341Z  all storage nodes connected
  2026-04-07T13:06:20.601291Z  ready for connections

- ndb3:
  2026-04-07T13:06:20.214853Z  all storage nodes connected
  2026-04-07T13:06:20.630818Z  ready for connections

However, immediately afterward, a 3-sample stability check still failed because one SQL/API node briefly lost local socket availability:

- last failing sample:
  depstate-b-ndb4:no-live-mysql-socket-found
- consecutive successful samples before failure:
  2 out of 3

This caused the next workload dispatch to abort before any new SQL was sent.

Interestingly, a later diagnostic snapshot already showed the node healthy again:
- local socket present
- local SELECT 1 successful
- ndb_mgm ALL STATUS showed all data nodes started

So this appears to be a short post-restart readiness / availability flap, not a permanent outage.

The problem is that the cluster appears ready according to startup logs, but local SQL/API availability is not yet stable enough for immediate use.

How to repeat:
This is timing-sensitive, so the important part is to probe the SQL/API nodes immediately after restart.

Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd data nodes
- 4 x mysqld SQL/API nodes
- Each SQL/API node has a local socket at:
  /var/run/mysqld/mysqld.sock

Initial config:
[ndbd default]
TimeBetweenEpochs=100

1. Start the cluster and wait until:
   ndb_mgm -e "ALL STATUS"
   shows all data nodes as started.

2. On each SQL/API node, verify:
   mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT 1;"

3. Optionally run a small baseline NDB workload, for example:
   CREATE DATABASE IF NOT EXISTS depstate_ab;
   CREATE TABLE depstate_ab.dep_pair_kv (
     id INT PRIMARY KEY,
     value INT,
     note VARCHAR(64)
   ) ENGINE=NDBCLUSTER;
   DELETE FROM depstate_ab.dep_pair_kv;
   INSERT INTO depstate_ab.dep_pair_kv VALUES (1,736,'seed_0');
   SELECT * FROM depstate_ab.dep_pair_kv;

4. Stop the cluster cleanly.

5. Change config:
   [ndbd default]
   TimeBetweenEpochs=75

6. Start the cluster again.

7. As soon as the SQL/API nodes begin to print startup success messages such as:
   "all storage nodes connected"
   and
   "ready for connections"
   start polling immediately for 5-15 seconds.

8. During that 5-15 second post-restart window, repeatedly check:
   a) management status:
      ndb_mgm -e "ALL STATUS"
   b) on each SQL/API node:
      ls -l /var/run/mysqld/mysqld.sock
   c) on each SQL/API node:
      mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -Nse "SELECT 1"

A simple way to probe the post-restart window is to poll every 200 ms:

for i in $(seq 1 75); do
  date -u +"%FT%TZ.%3N"
  ndb_mgm -e "ALL STATUS"
  for n in ndb1 ndb2 ndb3 ndb4; do
    docker exec <container-for-$n> bash -lc '
      ts=$(date -u +"%FT%TZ.%3N")
      sock=/var/run/mysqld/mysqld.sock
      if [ -S "$sock" ]; then
        mysql --protocol=SOCKET -uroot -S "$sock" -Nse "SELECT 1" >/tmp/out.$$ 2>/tmp/err.$$; rc=$?
        echo "$ts $(hostname) socket=present rc=$rc out=$(cat /tmp/out.$$ 2>/dev/null) err=$(tr \"\\n\" \" \" </tmp/err.$$ 2>/dev/null)"
      else
        echo "$ts $(hostname) socket=missing"
      fi'
  done
  sleep 0.2
done

Observed symptom:
Even after nodes have already logged "all storage nodes connected" and "ready for connections", one SQL/API node may briefly become non-queryable or lose visible local socket availability in the immediate post-restart window.

In the captured run, the affected node was the SQL/API node corresponding to ndb4.

Suggested fix:
Please check whether post-restart readiness is being reported too early for SQL/API nodes after a full cluster restart following a TimeBetweenEpochs change.

If a SQL/API node is not yet stably queryable via its local socket, it may be better to delay the point at which the node reports fully ready / fully connected, or to serialize the readiness transition more strictly.