Bug #20859 ndbd node fails to recover
Submitted: 5 Jul 2006 6:10 Modified: 7 Aug 2006 14:18
Reporter: David Abbott Email Updates:
Status: Not a Bug Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.22 OS:Linux (Red Hat Linux ES4)
Assigned to: CPU Architecture:Any

[5 Jul 2006 6:10] David Abbott
A MySQL 5.0.22 SQL Server node started reporting 241 and 1204 errors i.e.
"Got temporary error 1204 'Temporary failure, distribution changed'"

Re-starting the ndbd server also running on this node failed and generated
stack traces (attached).

How to repeat:
Unable to repeat, ndbd node still down.

Suggested fix:
Will probably re-build with ndbd --initial.
[5 Jul 2006 6:11] David Abbott
ndbd and mysql logs

Attachment: ndblogs.tgz (application/octet-stream, text), 78.62 KiB.

[5 Jul 2006 6:13] David Abbott
cluster config.ini

Attachment: config.ini (application/octet-stream, text), 767 bytes.

[5 Jul 2006 6:29] David Abbott
cluster log file excerpt

Attachment: ndb_1_cluster.log (application/octet-stream, text), 22.57 KiB.

[5 Jul 2006 12:50] Miguel Solorzano
Changing Category to Cluster.
[6 Jul 2006 13:37] Jonas Oreland
The cluster log (and error log) indicates heartbeat failures.

This often indicates very high load on cpu/disk/mem/network.

Can you examine if
1) there is any swapping going on (using vmstat or similar)
2) ndbd host machines have very high load (using top/vmstat or similar)
3) there has been any peek in load on machine, for example weekly fs-backup
   which might consume lots of memory/disk bandwith that might have locked ndbd out.

Otherwise, can you possibly identify some kind of pattern on mysqld where this occurs (but reading from log, it seems almost idle...)
[6 Jul 2006 14:43] David Abbott
Lo and behold, further investigation has revealed a disk problem. There did seem to be slow-downs on disk i/o, now we've tested the server in more depth disk errors are being generated.

Possibly a need for slightly better error messages ?