MySQL Bugs: #55868: cluster restart through various node failures triggered by some missed hearbeats

Bug #55868	cluster restart through various node failures triggered by some missed hearbeats
Submitted:	10 Aug 2010 0:31	Modified:	19 Oct 2016 23:06
Reporter:	Robert Klikics	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.1	OS:	Linux (Debian 5.0)
Assigned to:		CPU Architecture:	Any
Tags:	error 2303 2305 2315 7.1.4b

Description:
About 40 Minutes ago, all of our ndb data node's restartet through a chain of node failures, started with a "missed hearbeat" failure. After the forced restart of this node, a other node died with a "unpartitioned cluster" failure which makes no sense to us (we thought that the data is splitted between the nodes in nodegroups?!):

2010-08-10 01:32:34 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2315: 'Node declared dead. See error log for details(Arbitration error). Temporary error, restart node'.
2010-08-10 01:32:34 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2010-08-10 01:32:37 [MgmtSrvr] ALERT -- Node 5: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-08-10 01:32:38 [MgmtSrvr] ALERT -- Node 1: Node 5 Disconnected
2010-08-10 01:32:40 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

We could not find any error messages, which declare the failures.

It's possible that this bug is fixed in 7.1.5? And it is possible to do a online downgrade to 7.0.13 from 7.1.4b. 7.0.13 runs stable about a half year in our configuration, thus we think about a downgrade if it's possible.

A ndb_error_reporter report is attached under the following url:
http://85.25.144.101/files/ndb_error_report_20100810014130.tar.bz2

thanks
martin p.

How to repeat:
No idea atm.

tags updated

About 20 minutes ago, once again one of our cluster nodes died after a failing heartbeat, but this time the ndbd process on the data node was killed through a general protection fault:

mgm logs:
2010-08-27 01:34:01 [MgmtSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 2
2010-08-27 01:34:06 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

ndb data node syslog:
Aug 27 01:35:06 ndb1 kernel: [2859380.533352] ndbd[13128] general protection ip:7f1fb354e23f sp:7fffffffd4d0 error:0 in libc-2.7.so[7f1fb34ea000+14a000]

A ndb_error_reporter report is attached under the following url:
http://85.25.144.101/files/ndb_error_report_20100827014513.tar.bz2

thanks
martin p.

After analyzing the logs, it seem's that the general protection fault was not the problem the node died. It's seemly occures because the node can't allocate it's node id from the mgm server after it was killed by himself:

2010-08-27 01:34:04 [ndbd] INFO     -- Node 2 killed this node because GCP stop was detected
2010-08-27 01:34:04 [ndbd] INFO     -- NDBCNTR (Line: 274) 0x00000008
2010-08-27 01:34:04 [ndbd] INFO     -- Error handler restarting system
2010-08-27 01:34:04 [ndbd] INFO     -- Error handler shutdown completed - exiting
2010-08-27 01:34:06 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-08-27 01:34:06 [ndbd] INFO     -- Ndb has terminated (pid 31247) restarting
2010-08-27 01:35:06 [ndbd] INFO     -- Unable to alloc node id
2010-08-27 01:35:06 [ndbd] INFO     -- Error : Could not alloc node id at 192.168.10.100 port 1186: Id 2 already allocated by another node.

About 30 minutes ago, the same behavior has repeated. A node died because he has missed some hearbeats, segfaults because he can't allocate his node id and the other nodes have died through other node failures.

This all seems to happen, when a cronjob runs, which deletes a lot of entries (> 100k) in our cluster?! Is this normal?

A ndb_error_reporter report is attached under the following url:
http://85.25.144.101/files/ndb_error_report_20101013071945.tar.bz2

Cheers
Martin P.

unfortunately, 6 years too late, no logs available any more.. this does sound familiar (and solved) but without logs I can't confirm :( nor reproduce. setting as "can't reproduce" as I don't expect any feedback after 6 years.