MySQL Bugs: #20859: ndbd node fails to recover

Bug #20859	ndbd node fails to recover
Submitted:	5 Jul 2006 6:10	Modified:	7 Aug 2006 14:18
Reporter:	David Abbott	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.22	OS:	Linux (Red Hat Linux ES4)
Assigned to:		CPU Architecture:	Any

Description:
A MySQL 5.0.22 SQL Server node started reporting 241 and 1204 errors i.e.
"Got temporary error 1204 'Temporary failure, distribution changed'"

Re-starting the ndbd server also running on this node failed and generated
stack traces (attached).

How to repeat:
Unable to repeat, ndbd node still down.

Suggested fix:
Will probably re-build with ndbd --initial.

ndbd and mysql logs

Attachment: ndblogs.tgz (application/octet-stream, text), 78.62 KiB.

cluster config.ini

Attachment: config.ini (application/octet-stream, text), 767 bytes.

cluster log file excerpt

Attachment: ndb_1_cluster.log (application/octet-stream, text), 22.57 KiB.

Changing Category to Cluster.

The cluster log (and error log) indicates heartbeat failures.

This often indicates very high load on cpu/disk/mem/network.

Can you examine if
1) there is any swapping going on (using vmstat or similar)
2) ndbd host machines have very high load (using top/vmstat or similar)
3) there has been any peek in load on machine, for example weekly fs-backup
   which might consume lots of memory/disk bandwith that might have locked ndbd out.

Otherwise, can you possibly identify some kind of pattern on mysqld where this occurs (but reading from log, it seems almost idle...)

Lo and behold, further investigation has revealed a disk problem. There did seem to be slow-downs on disk i/o, now we've tested the server in more depth disk errors are being generated.

Possibly a need for slightly better error messages ?