Bug #18610 | Faliure of faliure handeling causes Cluster-Servers to consume 99% of all cpu | ||
---|---|---|---|
Submitted: | 29 Mar 2006 9:12 | Modified: | 5 Apr 2006 7:32 |
Reporter: | Lars Torstensson | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 5.0.19 | OS: | Linux (RedHat ES 4) |
Assigned to: | CPU Architecture: | Any |
[29 Mar 2006 9:12]
Lars Torstensson
[29 Mar 2006 9:45]
Lars Torstensson
The bug causes cluster node 3 ( one of the 99% cpu nodes ) to crach. No Trace file or core file where written. #################### Cluster log ########################## Mar 29 11:10:56 nl2-db4 NDB[659]: [MgmSrvr] Node 3: Failure handling of node 5 has not completed in 43 min. - state = 3 Mar 29 11:11:55 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 missed heartbeat 2 Mar 29 11:11:56 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Failure handling of node 5 has not completed in 44 min. - state = 3 Mar 29 11:11:56 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Failure handling of node 5 has not completed in 44 min. - state = 3 Mar 29 11:11:56 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Failure handling of node 5 has not completed in 44 min. - state = 3 Mar 29 11:11:56 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Failure handling of node 5 has not completed in 44 min. - state = 3 Mar 29 11:11:57 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 missed heartbeat 3 Mar 29 11:11:57 nl2-db4 NDB[659]: [MgmSrvr] Node 7: Node 3 Connected Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 missed heartbeat 4 Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 declared dead due to missed heartbeat Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Communication to Node 3 closed Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Communication to Node 3 closed Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Communication to Node 3 closed Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Communication to Node 3 closed Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Network partitioning - arbitration required Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: President restarts arbitration thread [state=7] Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Arbitration won - positive reply from node 7 Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Node 3 Disconnected Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Node 3 Disconnected Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Node 3 Disconnected Mar 29 11:11:58 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Node 3 Disconnected Mar 29 11:11:59 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Prepare arbitrator node 7 [ticket=7b130003453ea795] Mar 29 11:11:59 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Prepare arbitrator node 7 [ticket=7b130003453ea795] Mar 29 11:11:59 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Prepare arbitrator node 7 [ticket=7b130003453ea795] Mar 29 11:11:59 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Started arbitrator node 7 [ticket=7b130003453ea795] Mar 29 11:12:02 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Communication to Node 3 opened Mar 29 11:12:02 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Communication to Node 3 opened Mar 29 11:12:02 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Communication to Node 3 opened Mar 29 11:12:02 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Communication to Node 3 opened Mar 29 11:12:09 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Local checkpoint 879 started. Keep GCI = 6730 oldest restorable GCI = 6439 Mar 29 11:12:56 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Failure handling of node 5 has not completed in 45 min. - state = 3 Mar 29 11:12:56 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Failure handling of node 5 has not completed in 45 min. - state = 3 Mar 29 11:12:57 nl2-db4 NDB[659]: [MgmSrvr] Node 2: Failure handling of node 5 has not completed in 45 min. - state = 3 #################### ndb_3_out.log ################# [root@nl3-db3 log]# more ndb_3_out.log 2006-03-28 14:52:08 [ndbd] INFO -- Angel pid: 5365 ndb pid: 5366 2006-03-28 14:52:09 [ndbd] INFO -- NDB Cluster -- DB node 3 2006-03-28 14:52:09 [ndbd] INFO -- Version 5.0.19 -- 2006-03-28 14:52:09 [ndbd] INFO -- Configuration fetched at nl2-db4.i3.bredband.com port 14500 2006-03-28 14:52:09 [ndbd] INFO -- WatchDog timer is set to 6000 ms 2006-03-28 14:52:14 [ndbd] INFO -- Start initiated (version 5.0.19)
[29 Mar 2006 16:18]
Valeriy Kravchuk
Thank you for a detailed bug report. Looks similar to bug #18043. Please, check.
[30 Mar 2006 7:43]
Lars Torstensson
No it's not the same, this bug leads to a failure of handeleing the error as you might see in the cluster log (local0.log). I don't find this in the cluster log in the other case. The othe problem with this bug is that the cpu load wen't up to 99% on 2 of the database-cluster nodes (and still is 24h after on database node 2 ). If you want i can uppload the core file. ar 30 09:45:22 nl2-db4 NDB[659]: [MgmSrvr] Node 4: Failure handling of node 5 has not completed in 1390 min. - state = 3 Mar 30 09:45:26 nl2-db4 NDB[659]: [MgmSrvr] Node 6: Failure handling of node 5 has not completed in 1390 min. - state = 3 Mar 30 09:46:16 nl2-db4 NDB[659]: [MgmSrvr] Node 1: Failure handling of node 5 has not completed in 1391 min. - state = 3
[30 Mar 2006 8:22]
Lars Torstensson
You can find the Trace file and core files here http://surf.macallan.nu/log.tar.gz
[31 Mar 2006 14:22]
Lars Torstensson
A nice side effect is that it is impossible to start the Cluster-Servers-Node 5. It stalls in start phase 1. A Total system shutdown is required in order to get the whole cluster up and running. I guess it is caused by this in the cluster log: Mar 31 14:01:38 nl2-db4 NDB[4251]: [MgmSrvr] Node 4: Failure handling of node 5 has not completed in 3077 min. - state = 3 This is the output from node 5: 2006-03-31 13:47:21 [ndbd] INFO -- NDB Cluster -- DB node 5 2006-03-31 13:47:21 [ndbd] INFO -- Version 5.0.19 -- 2006-03-31 13:47:21 [ndbd] INFO -- Configuration fetched at nl2-db4.i3.bredband.com port 14500 2006-03-31 13:47:21 [ndbd] INFO -- WatchDog timer is set to 6000 ms 2006-03-31 13:47:25 [ndbd] INFO -- Start initiated (version 5.0.19)
[4 Apr 2006 9:04]
Geert Vanderkelen
Hi Lars, This is probably the same problem as found in bug #16772, which has been fixed. I will put this bug on 'Need Feedback' so you can on your side check if it matches. Best regards, Geert
[4 Apr 2006 14:41]
Lars Torstensson
Ok it's a match.
[5 Apr 2006 7:32]
Valeriy Kravchuk
Duplicate of bug #16772.