Bug #48689 Restart of all NDB-nodes
Submitted: 11 Nov 2009 13:40 Modified: 31 Dec 2009 18:46
Reporter: Robert Klikics Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.0 OS:Any (Debian 5.0)
Assigned to: Andrew Hutchings CPU Architecture:Any
Tags: 7.0.9b, cluster, crash, ndb, restart, telco

[11 Nov 2009 13:40] Robert Klikics
Description:
Hello,

my cluster did a complete restart (all 4 ndb's) twice without any reason in the last 2 days. Here is some log-output:

2009-11-11 13:55:40 [MgmtSrvr] WARNING  -- Node 3: Node 11 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 6 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 7 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 9 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 10 missed heartbeat 2
2009-11-11 13:55:42 [MgmtSrvr] WARNING  -- Node 3: Node 28 missed heartbeat 2
2009-11-11 13:55:43 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:44 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:44 [MgmtSrvr] WARNING  -- Node 3: Node 8 missed heartbeat 2
2009-11-11 13:55:44 [MgmtSrvr] WARNING  -- Node 3: Node 11 missed heartbeat 2
2009-11-11 13:55:44 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 5 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:45 [MgmtSrvr] WARNING  -- Node 2: Transporter to node 3 reported error 0x16: The send buffer was full, but sleeping for a while solved
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 4 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 6 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 7 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 9 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 10 missed heartbeat 2
2009-11-11 13:55:46 [MgmtSrvr] WARNING  -- Node 3: Node 28 missed heartbeat 2
2009-11-11 13:55:47 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

After that all 4 NDB-nodes made a restart .. really bad!

The network seems not to be the bottleneck, because all ndb's are attached on a dedicated GBit switch.

I uploaded the data from ndb_error_reporter here (23M):
http://85.25.144.101/files/ndb_error_report_20091111135937.tar.bz2

Regards,
Robert

How to repeat:
No idea yet
[31 Dec 2009 18:46] Andrew Hutchings
Hello Robert,

The main problem appears to be network and looks to be due to SendBufferMemory set too small.  Please increase this.

You then had GCP stop errors.  Please see the section called "Disk Data and GCP Stop errors" near the bottom of the following manual page:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html