Description:
I observed a post-restart readiness / availability instability in MySQL Cluster Community Server 9.3.0-cluster.
Environment:
- MySQL Cluster Community Server 9.3.0-cluster
- Linux / Docker-based environment
- 1 management node, 4 data nodes (ndbmtd), 4 SQL/API nodes (mysqld)
- Local SQL socket path on each SQL/API node:
/var/run/mysqld/mysqld.sock
- Test database:
depstate_ab
I originally captured this while comparing two independent clusters side by side, but the core symptom concerns a single restarted cluster and does not depend on the differential harness.
What I did:
1. Started a healthy NDB cluster and verified that all data nodes were started.
2. Verified that all 4 SQL/API nodes could execute local socket-based SQL successfully.
3. Ran a short baseline workload successfully.
4. Changed this config in [ndbd default]:
TimeBetweenEpochs = 75
(old value was 100)
5. Restarted the cluster.
6. After restart, I waited until the SQL/API nodes began reporting:
- "all storage nodes connected"
- "ready for connections"
7. Immediately in that post-restart window, I checked socket/query availability on all SQL/API nodes.
What I expected:
Once the cluster reports that all storage nodes are connected and each SQL/API node reports "ready for connections", local socket-based SQL availability should remain stable.
What I observed:
In the captured failing run, all 4 SQL/API nodes on the restarted cluster reported successful startup in a very tight time window:
- ndb2:
2026-04-07T13:06:19.265764Z all storage nodes connected
2026-04-07T13:06:19.728736Z ready for connections
- ndb4:
2026-04-07T13:06:19.823562Z all storage nodes connected
2026-04-07T13:06:20.085269Z ready for connections
- ndb1:
2026-04-07T13:06:20.130341Z all storage nodes connected
2026-04-07T13:06:20.601291Z ready for connections
- ndb3:
2026-04-07T13:06:20.214853Z all storage nodes connected
2026-04-07T13:06:20.630818Z ready for connections
However, immediately afterward, a 3-sample stability check still failed because one SQL/API node briefly lost local socket availability:
- last failing sample:
depstate-b-ndb4:no-live-mysql-socket-found
- consecutive successful samples before failure:
2 out of 3
This caused the next workload dispatch to abort before any new SQL was sent.
Interestingly, a later diagnostic snapshot already showed the node healthy again:
- local socket present
- local SELECT 1 successful
- ndb_mgm ALL STATUS showed all data nodes started
So this appears to be a short post-restart readiness / availability flap, not a permanent outage.
The problem is that the cluster appears ready according to startup logs, but local SQL/API availability is not yet stable enough for immediate use.
How to repeat:
This is timing-sensitive, so the important part is to probe the SQL/API nodes immediately after restart.
Setup:
- MySQL Cluster Community Server 9.3.0-cluster
- 1 x ndb_mgmd
- 4 x ndbmtd data nodes
- 4 x mysqld SQL/API nodes
- Each SQL/API node has a local socket at:
/var/run/mysqld/mysqld.sock
Initial config:
[ndbd default]
TimeBetweenEpochs=100
1. Start the cluster and wait until:
ndb_mgm -e "ALL STATUS"
shows all data nodes as started.
2. On each SQL/API node, verify:
mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT 1;"
3. Optionally run a small baseline NDB workload, for example:
CREATE DATABASE IF NOT EXISTS depstate_ab;
CREATE TABLE depstate_ab.dep_pair_kv (
id INT PRIMARY KEY,
value INT,
note VARCHAR(64)
) ENGINE=NDBCLUSTER;
DELETE FROM depstate_ab.dep_pair_kv;
INSERT INTO depstate_ab.dep_pair_kv VALUES (1,736,'seed_0');
SELECT * FROM depstate_ab.dep_pair_kv;
4. Stop the cluster cleanly.
5. Change config:
[ndbd default]
TimeBetweenEpochs=75
6. Start the cluster again.
7. As soon as the SQL/API nodes begin to print startup success messages such as:
"all storage nodes connected"
and
"ready for connections"
start polling immediately for 5-15 seconds.
8. During that 5-15 second post-restart window, repeatedly check:
a) management status:
ndb_mgm -e "ALL STATUS"
b) on each SQL/API node:
ls -l /var/run/mysqld/mysqld.sock
c) on each SQL/API node:
mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -Nse "SELECT 1"
A simple way to probe the post-restart window is to poll every 200 ms:
for i in $(seq 1 75); do
date -u +"%FT%TZ.%3N"
ndb_mgm -e "ALL STATUS"
for n in ndb1 ndb2 ndb3 ndb4; do
docker exec <container-for-$n> bash -lc '
ts=$(date -u +"%FT%TZ.%3N")
sock=/var/run/mysqld/mysqld.sock
if [ -S "$sock" ]; then
mysql --protocol=SOCKET -uroot -S "$sock" -Nse "SELECT 1" >/tmp/out.$$ 2>/tmp/err.$$; rc=$?
echo "$ts $(hostname) socket=present rc=$rc out=$(cat /tmp/out.$$ 2>/dev/null) err=$(tr \"\\n\" \" \" </tmp/err.$$ 2>/dev/null)"
else
echo "$ts $(hostname) socket=missing"
fi'
done
sleep 0.2
done
Observed symptom:
Even after nodes have already logged "all storage nodes connected" and "ready for connections", one SQL/API node may briefly become non-queryable or lose visible local socket availability in the immediate post-restart window.
In the captured run, the affected node was the SQL/API node corresponding to ndb4.
Suggested fix:
Please check whether post-restart readiness is being reported too early for SQL/API nodes after a full cluster restart following a TimeBetweenEpochs change.
If a SQL/API node is not yet stably queryable via its local socket, it may be better to delay the point at which the node reports fully ready / fully connected, or to serialize the readiness transition more strictly.
Description: I observed a post-restart readiness / availability instability in MySQL Cluster Community Server 9.3.0-cluster. Environment: - MySQL Cluster Community Server 9.3.0-cluster - Linux / Docker-based environment - 1 management node, 4 data nodes (ndbmtd), 4 SQL/API nodes (mysqld) - Local SQL socket path on each SQL/API node: /var/run/mysqld/mysqld.sock - Test database: depstate_ab I originally captured this while comparing two independent clusters side by side, but the core symptom concerns a single restarted cluster and does not depend on the differential harness. What I did: 1. Started a healthy NDB cluster and verified that all data nodes were started. 2. Verified that all 4 SQL/API nodes could execute local socket-based SQL successfully. 3. Ran a short baseline workload successfully. 4. Changed this config in [ndbd default]: TimeBetweenEpochs = 75 (old value was 100) 5. Restarted the cluster. 6. After restart, I waited until the SQL/API nodes began reporting: - "all storage nodes connected" - "ready for connections" 7. Immediately in that post-restart window, I checked socket/query availability on all SQL/API nodes. What I expected: Once the cluster reports that all storage nodes are connected and each SQL/API node reports "ready for connections", local socket-based SQL availability should remain stable. What I observed: In the captured failing run, all 4 SQL/API nodes on the restarted cluster reported successful startup in a very tight time window: - ndb2: 2026-04-07T13:06:19.265764Z all storage nodes connected 2026-04-07T13:06:19.728736Z ready for connections - ndb4: 2026-04-07T13:06:19.823562Z all storage nodes connected 2026-04-07T13:06:20.085269Z ready for connections - ndb1: 2026-04-07T13:06:20.130341Z all storage nodes connected 2026-04-07T13:06:20.601291Z ready for connections - ndb3: 2026-04-07T13:06:20.214853Z all storage nodes connected 2026-04-07T13:06:20.630818Z ready for connections However, immediately afterward, a 3-sample stability check still failed because one SQL/API node briefly lost local socket availability: - last failing sample: depstate-b-ndb4:no-live-mysql-socket-found - consecutive successful samples before failure: 2 out of 3 This caused the next workload dispatch to abort before any new SQL was sent. Interestingly, a later diagnostic snapshot already showed the node healthy again: - local socket present - local SELECT 1 successful - ndb_mgm ALL STATUS showed all data nodes started So this appears to be a short post-restart readiness / availability flap, not a permanent outage. The problem is that the cluster appears ready according to startup logs, but local SQL/API availability is not yet stable enough for immediate use. How to repeat: This is timing-sensitive, so the important part is to probe the SQL/API nodes immediately after restart. Setup: - MySQL Cluster Community Server 9.3.0-cluster - 1 x ndb_mgmd - 4 x ndbmtd data nodes - 4 x mysqld SQL/API nodes - Each SQL/API node has a local socket at: /var/run/mysqld/mysqld.sock Initial config: [ndbd default] TimeBetweenEpochs=100 1. Start the cluster and wait until: ndb_mgm -e "ALL STATUS" shows all data nodes as started. 2. On each SQL/API node, verify: mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -e "SELECT 1;" 3. Optionally run a small baseline NDB workload, for example: CREATE DATABASE IF NOT EXISTS depstate_ab; CREATE TABLE depstate_ab.dep_pair_kv ( id INT PRIMARY KEY, value INT, note VARCHAR(64) ) ENGINE=NDBCLUSTER; DELETE FROM depstate_ab.dep_pair_kv; INSERT INTO depstate_ab.dep_pair_kv VALUES (1,736,'seed_0'); SELECT * FROM depstate_ab.dep_pair_kv; 4. Stop the cluster cleanly. 5. Change config: [ndbd default] TimeBetweenEpochs=75 6. Start the cluster again. 7. As soon as the SQL/API nodes begin to print startup success messages such as: "all storage nodes connected" and "ready for connections" start polling immediately for 5-15 seconds. 8. During that 5-15 second post-restart window, repeatedly check: a) management status: ndb_mgm -e "ALL STATUS" b) on each SQL/API node: ls -l /var/run/mysqld/mysqld.sock c) on each SQL/API node: mysql --protocol=SOCKET -uroot -S /var/run/mysqld/mysqld.sock -Nse "SELECT 1" A simple way to probe the post-restart window is to poll every 200 ms: for i in $(seq 1 75); do date -u +"%FT%TZ.%3N" ndb_mgm -e "ALL STATUS" for n in ndb1 ndb2 ndb3 ndb4; do docker exec <container-for-$n> bash -lc ' ts=$(date -u +"%FT%TZ.%3N") sock=/var/run/mysqld/mysqld.sock if [ -S "$sock" ]; then mysql --protocol=SOCKET -uroot -S "$sock" -Nse "SELECT 1" >/tmp/out.$$ 2>/tmp/err.$$; rc=$? echo "$ts $(hostname) socket=present rc=$rc out=$(cat /tmp/out.$$ 2>/dev/null) err=$(tr \"\\n\" \" \" </tmp/err.$$ 2>/dev/null)" else echo "$ts $(hostname) socket=missing" fi' done sleep 0.2 done Observed symptom: Even after nodes have already logged "all storage nodes connected" and "ready for connections", one SQL/API node may briefly become non-queryable or lose visible local socket availability in the immediate post-restart window. In the captured run, the affected node was the SQL/API node corresponding to ndb4. Suggested fix: Please check whether post-restart readiness is being reported too early for SQL/API nodes after a full cluster restart following a TimeBetweenEpochs change. If a SQL/API node is not yet stably queryable via its local socket, it may be better to delay the point at which the node reports fully ready / fully connected, or to serialize the readiness transition more strictly.