Bug #17609 Nodes crashes from time to time (ndb Nodes in Cluster)
Submitted: 21 Feb 2006 11:07 Modified: 12 Jun 2006 7:18
Reporter: Jörg Nowak Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.2_a-drop5p5 OS:Linux (Suse 9, 64 bit)
Assigned to: CPU Architecture:Any

[21 Feb 2006 11:07] Jörg Nowak
Description:
After some time of load (mixture read and update) some nodes crashes. It happens from time to time I could not find any pattern in it (different node ids, different mixture of load). 

Node 4: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

I'm not sure this bug was reported before by another users but I didn't find it in the cuurent open bugs.

How to repeat:
Difficult because it happens not very often but always when you don't expect it.

Suggested fix:
Bug fix
[21 Feb 2006 11:40] Hartmut Holzgraefe
Please upload config.ini, cluster log, error logs and tracefiles.
And we can see...
[21 Feb 2006 11:57] Jörg Nowak
Log files for Bug 17609

Attachment: bug17609.zip (application/x-zip-compressed, text), 94.32 KiB.

[14 Mar 2006 14:25] Jörg Nowak
The bug is still there in 5.1.2_a-drop5p9.

We have thwe impression that this occurs sometimes in the time when a client which is connected via NDB_API has connection problems to the management server.

May be that is a hint to search for it.
[21 Mar 2006 8:41] Jörg Nowak
Today night we had a complete cluster crash because of that:

Forced node shutdown completed. Initiated by signal 0. Caused by error 2809: 'Temporary on access to file(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2809: 'Temporary on access to file(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Node 4: Forced node shutdown completed. Initiated by signal 0. Caused by error 2809: 'Temporary on access to file(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
Node 6: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
Node 8: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
Node 7: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
Node 9: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

Therefore I change the priority to level S1 (critical)
[21 Mar 2006 9:08] Jörg Nowak
I tried to restart the crashed cluster but that fails too:

2006-03-21 09:58:28 [MgmSrvr] INFO     -- Node 8: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 5 sig->failNo =

This message comes in the majority of the 8 ndb node cluster therefore the cluster doesn't restart.
[12 May 2006 7:18] Valeriy Kravchuk
Please, try to repeat with the latest version in your brunch, -drop5p13, and inform about the results.
[12 Jun 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".