Bug #55868 | cluster restart through various node failures triggered by some missed hearbeats | ||
---|---|---|---|
Submitted: | 10 Aug 2010 0:31 | Modified: | 19 Oct 2016 23:06 |
Reporter: | Robert Klikics | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | mysql-5.1-telco-7.1 | OS: | Linux (Debian 5.0) |
Assigned to: | CPU Architecture: | Any | |
Tags: | error 2303 2305 2315 7.1.4b |
[10 Aug 2010 0:31]
Robert Klikics
[10 Aug 2010 0:36]
Robert Klikics
tags updated
[27 Aug 2010 0:03]
Robert Klikics
About 20 minutes ago, once again one of our cluster nodes died after a failing heartbeat, but this time the ndbd process on the data node was killed through a general protection fault: mgm logs: 2010-08-27 01:34:01 [MgmtSrvr] WARNING -- Node 2: Node 5 missed heartbeat 2 2010-08-27 01:34:06 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. ndb data node syslog: Aug 27 01:35:06 ndb1 kernel: [2859380.533352] ndbd[13128] general protection ip:7f1fb354e23f sp:7fffffffd4d0 error:0 in libc-2.7.so[7f1fb34ea000+14a000] A ndb_error_reporter report is attached under the following url: http://85.25.144.101/files/ndb_error_report_20100827014513.tar.bz2 thanks martin p.
[27 Aug 2010 0:30]
Robert Klikics
After analyzing the logs, it seem's that the general protection fault was not the problem the node died. It's seemly occures because the node can't allocate it's node id from the mgm server after it was killed by himself: 2010-08-27 01:34:04 [ndbd] INFO -- Node 2 killed this node because GCP stop was detected 2010-08-27 01:34:04 [ndbd] INFO -- NDBCNTR (Line: 274) 0x00000008 2010-08-27 01:34:04 [ndbd] INFO -- Error handler restarting system 2010-08-27 01:34:04 [ndbd] INFO -- Error handler shutdown completed - exiting 2010-08-27 01:34:06 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2010-08-27 01:34:06 [ndbd] INFO -- Ndb has terminated (pid 31247) restarting 2010-08-27 01:35:06 [ndbd] INFO -- Unable to alloc node id 2010-08-27 01:35:06 [ndbd] INFO -- Error : Could not alloc node id at 192.168.10.100 port 1186: Id 2 already allocated by another node.
[13 Oct 2010 5:36]
Robert Klikics
About 30 minutes ago, the same behavior has repeated. A node died because he has missed some hearbeats, segfaults because he can't allocate his node id and the other nodes have died through other node failures. This all seems to happen, when a cronjob runs, which deletes a lot of entries (> 100k) in our cluster?! Is this normal? A ndb_error_reporter report is attached under the following url: http://85.25.144.101/files/ndb_error_report_20101013071945.tar.bz2 Cheers Martin P.
[19 Oct 2016 23:06]
MySQL Verification Team
unfortunately, 6 years too late, no logs available any more.. this does sound familiar (and solved) but without logs I can't confirm :( nor reproduce. setting as "can't reproduce" as I don't expect any feedback after 6 years.