Bug #18610 Faliure of faliure handeling causes Cluster-Servers to consume 99% of all cpu
Submitted: 29 Mar 2006 9:12 Modified: 5 Apr 2006 7:32
Reporter: Lars Torstensson Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.0.19 OS:Linux (RedHat ES 4)
Assigned to: CPU Architecture:Any

[29 Mar 2006 9:12] Lars Torstensson
Description:
Faliure of faliure handeling causes Cluster-Servers to consume 99% of all cpu.
 
We have a 6 node 3 replicat cluster running on RedHat ES 4 (64bit) and a ndb_mgmd runnig on RedHat ES 4 (32bit).

A Cluster-Server node (5) where started and enterd start phase 2 (wating for node 6 to compleate its restart). Cluster-Server node 5 gives following error and stops. 

Node 5: Forced node shutdown completed. Occured during startphase 2. Initiated by signal 6. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

The failure handeling then buggs out leaving this message every minute in the cluster log.
Mar 29 10:32:43 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Failure handling of node 5 has not completed in 5 min. - state = 3

The cpu on 2 of the cluster-servers goes to 99% (server node 2 and 3).

  
  

How to repeat:
We have a 6 node 3 replicat cluster running on RedHat ES 4 (64bit) and a ndb_mgmd runnig on RedHat ES 4 (32bit).

1. All Cluster-Server nodes were running.
2. One Cluster-Client node were added to the config.ini file (node 39)
3. Cluster-Server node 5 and 6 where shut down.
4. Cluster-Server node 6 where started and enterd start phase 5.
5. Cluster-Server node 5 where started and enterd start phase 2 (wating for node 6 to compleate)
6. Cluster-Server node 5 gives following error and core dumps. 

Node 5: Forced node shutdown completed. Occured during startphase 2. Initiated by signal 6. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

7. Node 6 enters started mode.

#################### From mgmd server ######################

ndb_mgm> all status
Node 1: started (Version 5.0.19)
Node 2: started (Version 5.0.19)
Node 3: started (Version 5.0.19)
Node 4: started (Version 5.0.19)
Node 5: starting (Phase 2) (Version 5.0.19)
Node 6: starting (Phase 5) (Version 5.0.19)

ndb_mgm> Node 6: Started (version 5.0.19)
Node 5: Forced node shutdown completed. Occured during startphase 2. Initiated by signal 6. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

ndb_mgm> all status
Node 1: started (Version 5.0.19)
Node 2: started (Version 5.0.19)
Node 3: started (Version 5.0.19)
Node 4: started (Version 5.0.19)
Node 5: not connected
Node 6: started (Version 5.0.19)

#######################
[29 Mar 2006 9:45] Lars Torstensson
The bug causes cluster node 3 ( one of the 99% cpu nodes ) to crach.
No Trace file or core file where written.

#################### Cluster log ##########################  

Mar 29 11:10:56 nl2-db4 NDB[659]: [MgmSrvr] Node 3: Failure handling of node 5 has not completed in 43 min. - state = 3
Mar 29 11:11:55 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 missed heartbeat 2
Mar 29 11:11:56 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Failure handling of node 5 has not completed in 44 min. - state = 3
Mar 29 11:11:56 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Failure handling of node 5 has not completed in 44 min. - state = 3
Mar 29 11:11:56 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Failure handling of node 5 has not completed in 44 min. - state = 3
Mar 29 11:11:56 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Failure handling of node 5 has not completed in 44 min. - state = 3
Mar 29 11:11:57 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 missed heartbeat 3
Mar 29 11:11:57 nl2-db4 NDB[659]: [MgmSrvr] Node 7: Node 3 Connected
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 missed heartbeat 4
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 declared dead due to missed heartbeat
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Communication to Node 3 closed
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Communication to Node 3 closed
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Communication to Node 3 closed
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Communication to Node 3 closed
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Network partitioning - arbitration required
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: President restarts arbitration thread [state=7]
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Arbitration won - positive reply from node 7
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 Disconnected
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Node 3 Disconnected
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Node 3 Disconnected
Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Node 3 Disconnected
Mar 29 11:11:59 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Prepare arbitrator node 7 [ticket=7b130003453ea795]
Mar 29 11:11:59 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Prepare arbitrator node 7 [ticket=7b130003453ea795]
Mar 29 11:11:59 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Prepare arbitrator node 7 [ticket=7b130003453ea795]
Mar 29 11:11:59 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Started arbitrator node 7 [ticket=7b130003453ea795]
Mar 29 11:12:02 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Communication to Node 3 opened
Mar 29 11:12:02 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Communication to Node 3 opened
Mar 29 11:12:02 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Communication to Node 3 opened
Mar 29 11:12:02 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Communication to Node 3 opened
Mar 29 11:12:09 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Local checkpoint 879 started. Keep GCI = 6730 oldest restorable GCI = 6439
Mar 29 11:12:56 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Failure handling of node 5 has not completed in 45 min. - state = 3
Mar 29 11:12:56 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Failure handling of node 5 has not completed in 45 min. - state = 3
Mar 29 11:12:57 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Failure handling of node 5 has not completed in 45 min. - state = 3

####################  ndb_3_out.log #################

[root@nl3-db3 log]# more ndb_3_out.log 
2006-03-28 14:52:08 [ndbd] INFO     -- Angel pid: 5365 ndb pid: 5366
2006-03-28 14:52:09 [ndbd] INFO     -- NDB Cluster -- DB node 3
2006-03-28 14:52:09 [ndbd] INFO     -- Version 5.0.19 --
2006-03-28 14:52:09 [ndbd] INFO     -- Configuration fetched at nl2-db4.i3.bredband.com port 14500
2006-03-28 14:52:09 [ndbd] INFO     -- WatchDog timer is set to 6000 ms
2006-03-28 14:52:14 [ndbd] INFO     -- Start initiated (version 5.0.19)
[29 Mar 2006 16:18] Valeriy Kravchuk
Thank you for a detailed bug report. Looks similar to bug #18043. Please, check.
[30 Mar 2006 7:43] Lars Torstensson
No it's not the same, this bug leads to a failure of handeleing the error as you might see in the cluster log (local0.log).  I don't find this in the cluster log in the other case. The othe problem with this bug is that the cpu load wen't up to 99%  on 2 of the database-cluster nodes (and still is 24h after on database node 2 ).  

If you want i can uppload the core file.

ar 30 09:45:22 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Failure handling of node 5 has not completed in 1390 min. - state = 3
Mar 30 09:45:26 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Failure handling of node 5 has not completed in 1390 min. - state = 3
Mar 30 09:46:16 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Failure handling of node 5 has not completed in 1391 min. - state = 3
[30 Mar 2006 8:22] Lars Torstensson
You can find the Trace file and core files here
http://surf.macallan.nu/log.tar.gz
[31 Mar 2006 14:22] Lars Torstensson
A nice side effect is that it is impossible to start the Cluster-Servers-Node 5. It stalls in start phase 1. A Total system shutdown is required in order to get the whole cluster up and running.

I guess it is caused by this in the cluster log:
Mar 31 14:01:38 nl2-db4 NDB[4251]: [MgmSrvr] Node 4: Failure handling of node 5 has not completed in 3077 min. - state = 3

This is the output from node 5:
2006-03-31 13:47:21 [ndbd] INFO     -- NDB Cluster -- DB node 5
2006-03-31 13:47:21 [ndbd] INFO     -- Version 5.0.19 --
2006-03-31 13:47:21 [ndbd] INFO     -- Configuration fetched at nl2-db4.i3.bredband.com port 14500
2006-03-31 13:47:21 [ndbd] INFO     -- WatchDog timer is set to 6000 ms
2006-03-31 13:47:25 [ndbd] INFO     -- Start initiated (version 5.0.19)
[4 Apr 2006 9:04] Geert Vanderkelen
Hi Lars,

This is probably the same problem as found in bug #16772, which has been fixed.
I will put this bug on 'Need Feedback' so you can on your side check if it matches.

Best regards,

Geert
[4 Apr 2006 14:41] Lars Torstensson
Ok it's a match.
[5 Apr 2006 7:32] Valeriy Kravchuk
Duplicate of  bug #16772.