Bug #55037 Seeminly random cluster crash with unrecoverable error during recovery
Submitted: 6 Jul 2010 21:47 Modified: 30 Sep 2010 12:48
Reporter: Hamid Badiozamani Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: NDB API Severity:S1 (Critical)
Version:5.0.90 OS:Linux
Assigned to: CPU Architecture:Any
Tags: ndb

[6 Jul 2010 21:47] Hamid Badiozamani
Description:
Hi guys,

    We got an unexpected crash with the following error message.

2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 3 sig->failNo =
2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 4: Communication to Node 2 closed
2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 4: Communication to Node 6 closed
2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 3: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 3 sig->failNo =
2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 closed
2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 closed
2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 5: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 3 sig->failNo =
2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 5: Communication to Node 2 closed
2010-07-01 21:18:22 [MgmSrvr] INFO     -- Node 5: Communication to Node 6 closed
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 5: Communication to Node 26 opened
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 3: Communication to Node 26 opened
2010-07-01 21:18:23 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2010-07-01 21:18:23 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2010-07-01 21:18:23 [MgmSrvr] ALERT    -- Node 5: Node 4 Disconnected
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 3: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 4 sig->failNo =
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 closed
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 closed
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 5: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 4 sig->failNo =
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 5: Communication to Node 2 closed
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 5: Communication to Node 4 closed
2010-07-01 21:18:23 [MgmSrvr] INFO     -- Node 5: Communication to Node 6 closed
2010-07-01 21:18:24 [MgmSrvr] INFO     -- Node 5: Communication to Node 6 opened
2010-07-01 21:18:24 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2010-07-01 21:18:25 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2010-07-01 21:18:25 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2010-07-01 21:18:25 [MgmSrvr] ALERT    -- Node 1: Node 5 Disconnected
2010-07-01 21:18:26 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2010-07-01 21:18:27 [MgmSrvr] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

What could the problem be? Thanks

-Hamid

How to repeat:
It doesn't seem to be reproducible. Upon attempting to restart the database cluster we consistently get:

2010-07-02 04:20:45 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 2306: 'Pointer too large(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

We ended up restoring from full backup.
[6 Jul 2010 21:52] Hamid Badiozamani
NDB Trace log for failed node

Attachment: ndb_4_trace.zip (application/zip, text), 53.27 KiB.

[31 Aug 2010 12:48] Martin Skold
Please try using a newer version, 5.0.90 is a very old version.
[1 Oct 2010 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".