Description:
Hello,
my cluster did a complete restart (all 4 ndb's) twice without any reason in the last 2 days. Here is some log-output:
2009-11-11 13:55:40 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 9 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 10 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2
2009-11-11 13:55:43 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 3: Node 8 missed heartbeat 2
2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 3: Node 11 missed heartbeat 2
2009-11-11 13:55:44 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 4 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 6 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 7 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 9 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 10 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING -- Node 3: Node 28 missed heartbeat 2
2009-11-11 13:55:47 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
After that all 4 NDB-nodes made a restart .. really bad!
The network seems not to be the bottleneck, because all ndb's are attached on a dedicated GBit switch.
I uploaded the data from ndb_error_reporter here (23M):
http://85.25.144.101/files/ndb_error_report_20091111135937.tar.bz2
Regards,
Robert
How to repeat:
No idea yet