Description:
I observed a short post-restart SQL/API readiness instability in MySQL Cluster Community Server 9.3.0-cluster after changing a checkpoint-related NDB parameter.
Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- 1 management node, 4 data nodes (ndbmtd), 4 SQL/API nodes (mysqld)
- Local SQL socket on each SQL/API node:
/var/run/mysqld/mysqld.sock
- Test database:
depstate_ab
What I did:
1. Started a healthy NDB cluster.
2. Verified that all data nodes were started and local socket SQL worked.
3. Ran a short baseline workload successfully.
4. Changed this config in [ndbd default]:
TimeBetweenGlobalCheckpoints = 1500
(old value was 2000)
5. Restarted the cluster.
6. Waited for the SQL/API nodes to report:
- "all storage nodes connected"
- "ready for connections"
7. Immediately after restart, before sending the next workload, I checked short-interval readiness on the SQL/API nodes.
What I expected:
After all SQL/API nodes report "all storage nodes connected" and "ready for connections", local socket-based SQL readiness should remain stable.
What I observed:
All 4 SQL/API nodes on the restarted cluster reported successful startup in the same 1-second window:
- ndb3:
2026-04-02T23:36:59.290510Z all storage nodes connected
2026-04-02T23:36:59.518929Z ready for connections
- ndb4:
2026-04-02T23:36:59.338113Z all storage nodes connected
2026-04-02T23:36:59.583944Z ready for connections
- ndb2:
2026-04-02T23:36:59.341932Z all storage nodes connected
2026-04-02T23:36:59.571135Z ready for connections
- ndb1:
2026-04-02T23:36:59.361833Z all storage nodes connected
2026-04-02T23:36:59.602983Z ready for connections
However, an immediate 3-sample post-restart stability check still failed:
- 2 consecutive samples were OK
- the last failing sample was:
ndb4:no-live-mysql-socket-found
This aborted the next workload dispatch before any new SQL was sent.
A later diagnostic snapshot already showed the cluster healthy again:
- ndb_mgm ALL STATUS showed all data nodes started
- all 4 SQL/API nodes had a live local socket
- local SELECT 1 succeeded on all 4 SQL/API nodes
So this does not look like a permanent outage. It looks like a short readiness flap immediately after restart: the SQL/API nodes already report ready, but local socket/query readiness is still not stable enough for immediate use.
How to repeat:
This issue appears to be timing-sensitive. The key is to probe the SQL/API nodes immediately after restart.
Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd data nodes
- 4 x mysqld SQL/API nodes
- Each SQL/API node has local socket:
/var/run/mysqld/mysqld.sock
Initial config:
[ndbd default]
TimeBetweenGlobalCheckpoints=2000
1. Start the cluster and wait until:
ndb_mgm -e "ALL STATUS"
shows all data nodes as started.
2. On each SQL/API node, verify:
mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT 1;"
3. Run a small NDB workload, for example:
CREATE DATABASE IF NOT EXISTS depstate_ab;
USE depstate_ab;
CREATE TABLE dep_pair_kv (
id INT PRIMARY KEY,
value INT,
note VARCHAR(64)
) ENGINE=NDBCLUSTER;
DELETE FROM dep_pair_kv;
INSERT INTO dep_pair_kv VALUES (1,538,'seed_0'),(7,946,'seed_6'),(13,420,'seed_12');
SELECT COUNT(*) FROM dep_pair_kv;
4. Stop the cluster cleanly.
5. Change config:
[ndbd default]
TimeBetweenGlobalCheckpoints=1500
6. Start the cluster again.
7. Watch the mysqld logs on all 4 SQL/API nodes. As soon as they begin printing:
"all storage nodes connected"
and
"ready for connections"
start polling immediately for 5-15 seconds.
8. During that post-restart window, repeatedly run:
a) on management node:
ndb_mgm -e "ALL STATUS"
b) on each SQL/API node:
ls -l /var/run/mysqld/mysqld.sock
c) on each SQL/API node:
mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -Nse "SELECT 1"
A practical way is to poll every 200 ms:
for i in $(seq 1 75); do
date -u +"%FT%TZ.%3N"
ndb_mgm -e "ALL STATUS"
for n in ndb1 ndb2 ndb3 ndb4; do
docker exec <container-for-$n> bash -lc '
ts=$(date -u +"%FT%TZ.%3N")
sock=/var/run/mysqld/mysqld.sock
if [ -S "$sock" ]; then
mysql --protocol=SOCKET -uroot -S "$sock" -Nse "SELECT 1" >/tmp/out.$$ 2>/tmp/err.$$; rc=$?
echo "$ts $(hostname) socket=present rc=$rc out=$(cat /tmp/out.$$ 2>/dev/null) err=$(tr "\n" " " </tmp/err.$$ 2>/dev/null)"
else
echo "$ts $(hostname) socket=missing"
fi'
done
sleep 0.2
done
Observed failing run:
- All 4 SQL/API nodes logged startup success around:
2026-04-02T23:36:59.290Z to 2026-04-02T23:36:59.603Z
- In that immediate post-restart window, one SQL/API node (ndb4 in the captured run) briefly failed the local socket readiness check:
no-live-mysql-socket-found
- A later snapshot already showed the node healthy again.
So the symptom to look for is:
a very short post-restart window in which a SQL/API node is not yet stably queryable even though all nodes have already logged "all storage nodes connected" and "ready for connections".
Suggested fix:
Please check whether SQL/API readiness is being reported slightly too early after a full cluster restart when TimeBetweenGlobalCheckpoints is changed.
The SQL/API node appears to log "all storage nodes connected" and "ready for connections" before local socket/query readiness is fully stable. A stricter readiness transition, or a short stabilization barrier before reporting ready, may help.
Description: I observed a short post-restart SQL/API readiness instability in MySQL Cluster Community Server 9.3.0-cluster after changing a checkpoint-related NDB parameter. Environment: - MySQL Cluster Community Server 9.3.0-cluster - Linux / Docker-based environment - 1 management node, 4 data nodes (ndbmtd), 4 SQL/API nodes (mysqld) - Local SQL socket on each SQL/API node: /var/run/mysqld/mysqld.sock - Test database: depstate_ab What I did: 1. Started a healthy NDB cluster. 2. Verified that all data nodes were started and local socket SQL worked. 3. Ran a short baseline workload successfully. 4. Changed this config in [ndbd default]: TimeBetweenGlobalCheckpoints = 1500 (old value was 2000) 5. Restarted the cluster. 6. Waited for the SQL/API nodes to report: - "all storage nodes connected" - "ready for connections" 7. Immediately after restart, before sending the next workload, I checked short-interval readiness on the SQL/API nodes. What I expected: After all SQL/API nodes report "all storage nodes connected" and "ready for connections", local socket-based SQL readiness should remain stable. What I observed: All 4 SQL/API nodes on the restarted cluster reported successful startup in the same 1-second window: - ndb3: 2026-04-02T23:36:59.290510Z all storage nodes connected 2026-04-02T23:36:59.518929Z ready for connections - ndb4: 2026-04-02T23:36:59.338113Z all storage nodes connected 2026-04-02T23:36:59.583944Z ready for connections - ndb2: 2026-04-02T23:36:59.341932Z all storage nodes connected 2026-04-02T23:36:59.571135Z ready for connections - ndb1: 2026-04-02T23:36:59.361833Z all storage nodes connected 2026-04-02T23:36:59.602983Z ready for connections However, an immediate 3-sample post-restart stability check still failed: - 2 consecutive samples were OK - the last failing sample was: ndb4:no-live-mysql-socket-found This aborted the next workload dispatch before any new SQL was sent. A later diagnostic snapshot already showed the cluster healthy again: - ndb_mgm ALL STATUS showed all data nodes started - all 4 SQL/API nodes had a live local socket - local SELECT 1 succeeded on all 4 SQL/API nodes So this does not look like a permanent outage. It looks like a short readiness flap immediately after restart: the SQL/API nodes already report ready, but local socket/query readiness is still not stable enough for immediate use. How to repeat: This issue appears to be timing-sensitive. The key is to probe the SQL/API nodes immediately after restart. Setup: - MySQL Cluster Community Server 9.3.0-cluster - 1 x ndb_mgmd - 4 x ndbmtd data nodes - 4 x mysqld SQL/API nodes - Each SQL/API node has local socket: /var/run/mysqld/mysqld.sock Initial config: [ndbd default] TimeBetweenGlobalCheckpoints=2000 1. Start the cluster and wait until: ndb_mgm -e "ALL STATUS" shows all data nodes as started. 2. On each SQL/API node, verify: mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT 1;" 3. Run a small NDB workload, for example: CREATE DATABASE IF NOT EXISTS depstate_ab; USE depstate_ab; CREATE TABLE dep_pair_kv ( id INT PRIMARY KEY, value INT, note VARCHAR(64) ) ENGINE=NDBCLUSTER; DELETE FROM dep_pair_kv; INSERT INTO dep_pair_kv VALUES (1,538,'seed_0'),(7,946,'seed_6'),(13,420,'seed_12'); SELECT COUNT(*) FROM dep_pair_kv; 4. Stop the cluster cleanly. 5. Change config: [ndbd default] TimeBetweenGlobalCheckpoints=1500 6. Start the cluster again. 7. Watch the mysqld logs on all 4 SQL/API nodes. As soon as they begin printing: "all storage nodes connected" and "ready for connections" start polling immediately for 5-15 seconds. 8. During that post-restart window, repeatedly run: a) on management node: ndb_mgm -e "ALL STATUS" b) on each SQL/API node: ls -l /var/run/mysqld/mysqld.sock c) on each SQL/API node: mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -Nse "SELECT 1" A practical way is to poll every 200 ms: for i in $(seq 1 75); do date -u +"%FT%TZ.%3N" ndb_mgm -e "ALL STATUS" for n in ndb1 ndb2 ndb3 ndb4; do docker exec <container-for-$n> bash -lc ' ts=$(date -u +"%FT%TZ.%3N") sock=/var/run/mysqld/mysqld.sock if [ -S "$sock" ]; then mysql --protocol=SOCKET -uroot -S "$sock" -Nse "SELECT 1" >/tmp/out.$$ 2>/tmp/err.$$; rc=$? echo "$ts $(hostname) socket=present rc=$rc out=$(cat /tmp/out.$$ 2>/dev/null) err=$(tr "\n" " " </tmp/err.$$ 2>/dev/null)" else echo "$ts $(hostname) socket=missing" fi' done sleep 0.2 done Observed failing run: - All 4 SQL/API nodes logged startup success around: 2026-04-02T23:36:59.290Z to 2026-04-02T23:36:59.603Z - In that immediate post-restart window, one SQL/API node (ndb4 in the captured run) briefly failed the local socket readiness check: no-live-mysql-socket-found - A later snapshot already showed the node healthy again. So the symptom to look for is: a very short post-restart window in which a SQL/API node is not yet stably queryable even though all nodes have already logged "all storage nodes connected" and "ready for connections". Suggested fix: Please check whether SQL/API readiness is being reported slightly too early after a full cluster restart when TimeBetweenGlobalCheckpoints is changed. The SQL/API node appears to log "all storage nodes connected" and "ready for connections" before local socket/query readiness is fully stable. A stricter readiness transition, or a short stabilization barrier before reporting ready, may help.