MySQL Bugs: #68719: LCP Frag watchdog crashes NDB data node

Bug #68719	LCP Frag watchdog crashes NDB data node
Submitted:	19 Mar 2013 14:37	Modified:	10 Jan 2014 1:13
Reporter:	Patrick Zoblisein	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.5.29-7.2.10	OS:	Linux (Centos 5.6)
Assigned to:		CPU Architecture:	Any

Description:
LCP Frag watchdog : Checkpoint of table 55 fragment 9 too slow (no progress for > 60 s).
2013-03-19 10:04:44 [ndbd] INFO     -- Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
2013-03-19 10:04:56 [ndbd] ALERT    -- Node 6: Forced node shutdown completed. Caused by error 7200: 'LCP fragment scan watchdog detected a problem.  Please report a bug.(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
Unsure at this point as system was idle at time of data node crash due to the LCP Frag watchdog process.

ndb_error_report bz2 archive

Attachment: ndb_error_report_20130319102906.tar.bz2 (application/octet-stream, text), 906.51 KiB.

Thank you for taking the time to write to us, but this is not a bug. 

Looking at the (cut-down) configuration of the Cluster, it shows:
....
[NDBD]
NodeId=3
Hostname=ndb01

[NDBD]
NodeId=4
Hostname=ndb02

[NDBD]
NodeId=5
Hostname=ndb03

[NDBD]
NodeId=6
Hostname=ndb04
....
[api]
hostname=ndb01           # Hostname or IP address
nodeid=7

[api]
hostname=ndb01           # Hostname or IP address
nodeid=8

[api]
hostname=ndb01           # Hostname or IP address
nodeid=9                 
.....

Here it shows data nodes spread over 4 different servers, but the api nodes are also on the same hosts (just ndb01 shown above). This is known to put excessive I/O load on some setups which cause the LCP frag scans to stall and generate the LCP watchdog timeouts that you see. 

Please look at spreading the load onto other servers instead of the data nodes.