Bug #36759 MySQL Cluster Node Failure
Submitted: 16 May 2008 15:00 Modified: 16 Jun 2008 15:13
Reporter: Riyaaz Domingo Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.24 OS:Other (Debian)
Assigned to: CPU Architecture:Any

[16 May 2008 15:00] Riyaaz Domingo
Description:

When re-introdcuing a second node back into a two node cluster, the failure of the second node resulted in the first node crashing. I have extracted the log data from both the ndb log as well as the management node log.

2008-05-16 15:32:10 [MgmSrvr] INFO     -- Node 2: Node 3 is WAIT_LCP including in LCP
2008-05-16 15:38:09 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-05-16 15:38:09 [MgmSrvr] ALERT    -- Node 2: Node 3 Disconnected
2008-05-16 15:38:09 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 closed
2008-05-16 15:38:09 [MgmSrvr] ALERT    -- Node 2: Network partitioning - arbitration required
2008-05-16 15:38:09 [MgmSrvr] INFO     -- Node 2: President restarts arbitration thread [state=7]
2008-05-16 15:38:10 [MgmSrvr] ALERT    -- Node 2: Arbitration won - positive reply from node 1
2008-05-16 15:38:10 [MgmSrvr] INFO     -- Node 2: DICT: remove lock by failed node 3 for NodeRestart
2008-05-16 15:38:10 [MgmSrvr] INFO     -- Node 2: DICT: lock bs: 0 ops: 0 poll: 0 cnt: 0 queue:
2008-05-16 15:38:11 [MgmSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=63430029f1f3b40a]
2008-05-16 15:39:15 [MgmSrvr] WARNING  -- Node 2: Failure handling of node 3 has not completed in 1 min. - state = 3
2008-05-16 15:40:20 [MgmSrvr] WARNING  -- Node 2: Failure handling of node 3 has not completed in 2 min. - state = 3
2008-05-16 15:41:26 [MgmSrvr] WARNING  -- Node 2: Failure handling of node 3 has not completed in 3 min. - state = 3
2008-05-16 15:42:08 [MgmSrvr] INFO     -- Node 2: Communication to Node 3 opened
2008-05-16 15:43:49 [MgmSrvr] ALERT    -- Node 2: Node 4 Disconnected
2008-05-16 15:43:49 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 closed
2008-05-16 15:43:53 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 opened
2008-05-16 15:43:54 [MgmSrvr] INFO     -- Mgmt server state: nodeid 4 reserved for ip 10.178.110.206, m_reserved_nodes 0000000000000012.
2008-05-16 15:43:54 [MgmSrvr] INFO     -- Node 4: mysqld --server-id=0
2008-05-16 15:43:55 [MgmSrvr] INFO     -- Node 2: Node 4 Connected
2008-05-16 15:43:55 [MgmSrvr] INFO     -- Node 2: Node 4: API version 5.1.24
2008-05-16 15:44:00 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-05-16 15:44:03 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Please advise if you require any other information regarding the hardware, os or additional log information. For now i have only started the cluster with teh working data node.

Regards,

How to repeat:
I tried this on test and it worked fine, however i cannot replicate it again on out production setup as this will cause a service impact.
[16 May 2008 15:13] Hartmut Holzgraefe
Can you provide your config.ini and the log files from the data nodes, too?

(ndb_*_out.log, ndb_*_error.log, ndb_*_trace.log*)
[16 Jun 2008 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".