Bug #26931 Cluster Crashed
Submitted: 7 Mar 2007 19:32 Modified: 19 Jun 2009 12:51
Reporter: Jeremy Kusnetz Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1 OS:Linux (Linux x86_64)
Assigned to: Assigned Account CPU Architecture:Any
Tags: 5.1.16

[7 Mar 2007 19:32] Jeremy Kusnetz
Description:
New cluster setup, was performing some transactions when the application errored with "DBD::mysql::st execute failed: Got temporary error 4025 'Node failure caused abort of transaction' from NDBCLUSTER at ADMINTOOL/SESSION.pm line 46."

Found all ndbd processes on all the data nodes gone.

From the mgm cluster.log I see the following:

2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 2: Node 5 Disconnected
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 closed
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 3: Node 5 Disconnected
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 closed
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 4: Node 5 Disconnected
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 4: Communication to Node 5 closed
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 1: Node 5 Disconnected
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 7: Node 5 Disconnected
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 7: Communication to Node 5 closed
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 2: Node 6 Disconnected
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 2: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 8 sig->failNo =
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 closed
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 2: Communication to Node 6 closed
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 1: Node 6 Disconnected
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 2: Arbitration check won - node group majority
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=6]
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 3: Node 6 Disconnected
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 3: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 8 sig->failNo =
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 closed
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 3: Communication to Node 6 closed
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 4: Node 6 Disconnected
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 4: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 8 sig->failNo =
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 4: Communication to Node 5 closed
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 4: Communication to Node 6 closed
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 7: Node 6 Disconnected
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 7: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 8 sig->failNo =
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 7: Communication to Node 5 closed
2007-03-07 19:16:23 [MgmSrvr] INFO     -- Node 7: Communication to Node 6 closed
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 1: Node 6 Disconnected
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2007-03-07 19:16:23 [MgmSrvr] ALERT    -- Node 6: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2007-03-07 19:16:24 [MgmSrvr] ALERT    -- Node 2: Node 4 Disconnected
2007-03-07 19:16:24 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 closed
2007-03-07 19:16:24 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2007-03-07 19:16:24 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2007-03-07 19:16:24 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2007-03-07 19:16:24 [MgmSrvr] ALERT    -- Node 7: Node 4 Disconnected
2007-03-07 19:16:24 [MgmSrvr] INFO     -- Node 7: Communication to Node 4 closed
2007-03-07 19:16:24 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2007-03-07 19:16:25 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2007-03-07 19:16:25 [MgmSrvr] ALERT    -- Node 3: Node 2 Disconnected
2007-03-07 19:16:25 [MgmSrvr] INFO     -- Node 3: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 10 sig->failNo
2007-03-07 19:16:25 [MgmSrvr] INFO     -- Node 3: Communication to Node 2 closed
2007-03-07 19:16:25 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2007-03-07 19:16:25 [MgmSrvr] ALERT    -- Node 7: Node 2 Disconnected
2007-03-07 19:16:25 [MgmSrvr] INFO     -- Node 7: Possible bug in Dbdih::execBLOCK_COMMIT_ORD c_blockCommit = 1 c_blockCommitNo = 10 sig->failNo
2007-03-07 19:16:25 [MgmSrvr] INFO     -- Node 7: Communication to Node 2 closed
2007-03-07 19:16:25 [MgmSrvr] INFO     -- Node 7: Communication to Node 4 closed
2007-03-07 19:16:25 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2007-03-07 19:16:25 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2007-03-07 19:16:25 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2007-03-07 19:16:26 [MgmSrvr] ALERT    -- Node 1: Node 7 Disconnected
2007-03-07 19:16:26 [MgmSrvr] ALERT    -- Node 7: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

How to repeat:
Just inserting, updating and querying data.  Nothing special.  One user, no load.
[12 Mar 2007 10:17] Hartmut Holzgraefe
Hi,

could you provide the full log files (cluster log, node error logs, node trace files, node output files) and the cluster config file for this incident?

You mind find the ndb_error_reporter tool usefull:
http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-utilities-ndb-error-reporter.html
[12 Mar 2007 22:21] Jeremy Kusnetz
ndb_error_report

Attachment: ndb_error_report_20070312221804.tar.bz2 (application/octet-stream, text), 409.21 KiB.

[14 Mar 2007 22:13] Jeremy Kusnetz
I'm getting more crashes the more I use it, I now see these errors:

2007-03-14 22:07:09 [ndbd] INFO     -- Error handler restarting system
2007-03-14 22:07:10 [ndbd] INFO     -- Error handler shutdown completed - exiting
2007-03-14 22:07:10 [ndbd] ALERT    -- Node 7: Forced node shutdown completed, restarting. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2007-03-14 22:07:10 [ndbd] INFO     -- Ndb has terminated (pid 19672) restarting
2007-03-14 22:07:13 [ndbd] INFO     -- Angel pid: 5457 ndb pid: 19074
2007-03-14 22:07:13 [ndbd] INFO     -- NDB Cluster -- DB node 7
2007-03-14 22:07:13 [ndbd] INFO     -- Version 5.1.16 (beta) --
2007-03-14 22:07:13 [ndbd] INFO     -- Configuration fetched at ndb_mgmd1 port 1186
2007-03-14 22:07:13 [ndbd] INFO     -- Start initiated (version 5.1.16)
2007-03-14 22:07:13 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 20Mb initial: 20Mb
WOPool::init(61, 9)
RWPool::init(82, 13)
RWPool::init(a2, 18)
RWPool::init(c2, 13)
RWPool::init(122, 18)
RWPool::init(142, 18)
WOPool::init(41, 12)
RWPool::init(e2, 12)
RWPool::init(102, 52)
WOPool::init(21, 10)
RESTORE table: 0 512 rows applied
RESTORE table: 0 265 rows applied
RESTORE table: 1 8 rows applied
RESTORE table: 1 4 rows applied
RESTORE table: 2 2 rows applied
RESTORE table: 2 1 rows applied
RESTORE table: 3 2 rows applied
RESTORE table: 3 1 rows applied
RESTORE table: 4 0 rows applied
RESTORE table: 4 0 rows applied
RESTORE table: 5 216 rows applied
RESTORE table: 5 183 rows applied
RESTORE table: 6 510 rows applied
RESTORE table: 6 466 rows applied
RESTORE table: 9 42678 rows applied
RESTORE table: 9 42732 rows applied
RESTORE table: 12 11779 rows applied
RESTORE table: 12 11713 rows applied
RESTORE table: 13 58 rows applied
RESTORE table: 13 55 rows applied
RESTORE table: 19 50582 rows applied
RESTORE table: 19 50501 rows applied
RESTORE table: 21 11529 rows applied
RESTORE table: 21 11867 rows applied
RESTORE table: 22 401 rows applied
RESTORE table: 22 417 rows applied
RESTORE table: 23 5705 rows applied
RESTORE table: 23 5829 rows applied
RESTORE table: 27 59 rows applied
RESTORE table: 27 52 rows applied
RESTORE table: 28 42711 rows applied
RESTORE table: 28 42767 rows applied
RESTORE table: 30 90 rows applied
RESTORE table: 30 90 rows applied
RESTORE table: 31 0 rows applied
RESTORE table: 31 0 rows applied
RESTORE table: 33 65 rows applied
RESTORE table: 33 60 rows applied
m_active_buckets.set(1)
table 2 options 3
table 3 options 1
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 1
sent SUBSCRIBE(11) to node 13, req_nodeid: 13  senderData: 72
table 4 options 0
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 2
sent SUBSCRIBE(11) to node 8, req_nodeid: 8  senderData: 72
sent SUBSCRIBE(11) to node 13, req_nodeid: 8  senderData: 72
sent SUBSCRIBE(11) to node 8, req_nodeid: 13  senderData: 72
table 9 options 0
table 12 options 0
table 13 options 1
table 33 options 1
table 22 options 1
table 28 options 0
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 3
sent SUBSCRIBE(11) to node 9, req_nodeid: 9  senderData: 72
sent SUBSCRIBE(11) to node 8, req_nodeid: 9  senderData: 72
sent SUBSCRIBE(11) to node 9, req_nodeid: 8  senderData: 72
sent SUBSCRIBE(11) to node 13, req_nodeid: 9  senderData: 72
sent SUBSCRIBE(11) to node 9, req_nodeid: 13  senderData: 72
table 27 options 1
table 23 options 0
table 21 options 1
table 5 options 0
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 4
sent SUBSCRIBE(11) to node 12, req_nodeid: 12  senderData: 72
sent SUBSCRIBE(11) to node 9, req_nodeid: 12  senderData: 72
sent SUBSCRIBE(11) to node 12, req_nodeid: 9  senderData: 72
sent SUBSCRIBE(11) to node 8, req_nodeid: 12  senderData: 72
sent SUBSCRIBE(11) to node 12, req_nodeid: 8  senderData: 72
sent SUBSCRIBE(11) to node 13, req_nodeid: 12  senderData: 72
sent SUBSCRIBE(11) to node 12, req_nodeid: 13  senderData: 72
table 6 options 1
table 19 options 0
table 30 options 0
table 31 options 1
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 5
sent SUBSCRIBE(11) to node 10, req_nodeid: 10  senderData: 72
sent SUBSCRIBE(11) to node 12, req_nodeid: 10  senderData: 72
sent SUBSCRIBE(11) to node 10, req_nodeid: 12  senderData: 72
sent SUBSCRIBE(11) to node 9, req_nodeid: 10  senderData: 72
sent SUBSCRIBE(11) to node 10, req_nodeid: 9  senderData: 72
sent SUBSCRIBE(11) to node 8, req_nodeid: 10  senderData: 72
sent SUBSCRIBE(11) to node 10, req_nodeid: 8  senderData: 72
sent SUBSCRIBE(11) to node 13, req_nodeid: 10  senderData: 72
sent SUBSCRIBE(11) to node 10, req_nodeid: 13  senderData: 72
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 6
sent SUBSCRIBE(11) to node 11, req_nodeid: 11  senderData: 72
sent SUBSCRIBE(11) to node 10, req_nodeid: 11  senderData: 72
sent SUBSCRIBE(11) to node 11, req_nodeid: 10  senderData: 72
sent SUBSCRIBE(11) to node 12, req_nodeid: 11  senderData: 72
sent SUBSCRIBE(11) to node 11, req_nodeid: 12  senderData: 72
sent SUBSCRIBE(11) to node 9, req_nodeid: 11  senderData: 72
sent SUBSCRIBE(11) to node 11, req_nodeid: 9  senderData: 72
sent SUBSCRIBE(11) to node 8, req_nodeid: 11  senderData: 72
sent SUBSCRIBE(11) to node 11, req_nodeid: 8  senderData: 72
sent SUBSCRIBE(11) to node 13, req_nodeid: 11  senderData: 72
sent SUBSCRIBE(11) to node 11, req_nodeid: 13  senderData: 72
[21 Mar 2007 18:25] Jeremy Kusnetz
We changed from a 6 node 3 node groups to 8 nodes 4 node groups in hopes of fixing some stability.  Nope, the cluster just crashed again.  I will add the lastest ndb_error_report
[21 Mar 2007 18:33] Jeremy Kusnetz
The ndb_error_report was too big to upload through the website, so I've ftped it to you.

bug-data-26931.ZIP
[29 Mar 2007 15:09] Jeremy Kusnetz
I haven't really heard anything on this bug.  I've gotten a lot more crashes recently.  Is it worth uploading more error logs?
[19 May 2009 12:51] Jonathan Miller
Is this bug still valid with latest version?
[19 Jun 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".