Bug #19706 5.0.21 GCP Stop during COPY_GCI
Submitted: 11 May 2006 6:50 Modified: 16 Aug 2006 7:10
Reporter: Adam Dixon (Candidate Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.0.21 OS:Linux (RHEL4)
Assigned to: Jonas Oreland CPU Architecture:Any

[11 May 2006 6:50] Adam Dixon
Description:
A node, achieving a heartbeat failure (most likely due to heavy disk IO - rsync of BACKUP). Was forced to shutdown, but also looks to of crashed and created trace files;

Following from ndb_mgmd.log
2006-05-11 04:06:14 [MgmSrvr] INFO     -- Node 18: Local checkpoint 10602 started. Keep GCI = 1532073 oldest restorable GCI = 1532226
2006-05-11 04:12:44 [MgmSrvr] WARNING  -- Node 17: Node 18 missed heartbeat 2
2006-05-11 04:12:46 [MgmSrvr] WARNING  -- Node 17: Node 18 missed heartbeat 3
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 1: Node 18 Connected
2006-05-11 04:12:47 [MgmSrvr] WARNING  -- Node 17: Node 18 missed heartbeat 4
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 17: Node 18 declared dead due to missed heartbeat
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 17: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 11: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 12: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 13: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 14: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 15: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 16: Communication to Node 18 closed
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 17: Arbitration check won - node group majority
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 17: President restarts arbitration thread [state=6]
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 17: GCP Take over started
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 14: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 16: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 12: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 17: LCP Take over started
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 17: GCP Take over completed
2006-05-11 04:12:47 [MgmSrvr] INFO     -- Node 17: LCP Take over completed (state = 5)
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 17: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 11: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 15: Node 18 Disconnected
2006-05-11 04:12:47 [MgmSrvr] ALERT    -- Node 13: Node 18 Disconnected
2006-05-11 04:13:15 [MgmSrvr] INFO     -- Node 17: Communication to Node 18 opened
2006-05-11 04:13:15 [MgmSrvr] INFO     -- Node 12: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO     -- Node 13: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO     -- Node 11: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO     -- Node 14: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO     -- Node 16: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO     -- Node 15: Communication to Node 18 opened
2006-05-11 04:13:16 [MgmSrvr] INFO     -- Node 17: Local checkpoint 10603 started. Keep GCI = 1532271 oldest restorable GCI = 1532379
2006-05-11 04:13:28 [MgmSrvr] ALERT    -- Node 18: Forced node shutdown completed. Initiated by signal 0. Caused by error 2303: 'System error, node killed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary error, restart n
2006-05-11 04:18:41 [MgmSrvr] INFO     -- Node 17: Local checkpoint 10604 started. Keep GCI = 1532397 oldest restorable GCI = 1532553
2006-05-11 04:24:03 [MgmSrvr] INFO     -- Node 17: Local checkpoint 10605 started. Keep GCI = 1532570 oldest restorable GCI = 153272

Time: Thursday 11 May 2006 - 04:13:26
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node
(Internal error, programming error or missing error message, please
report a bug)
Error: 2303
Error data: Node 18 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 197) 0x0000000e
Program: ndbd
Pid: 32168
Trace: /data/mysql-cluster/ndbd/ndb_18_trace.log.4
Version: Version 5.0.21
***EOM***

Missing heartbeats caused the node to fail (or be arbitrarily shutdown) which is fine, Just thought I would provide the data to potentially improve handling.

How to repeat:

The rsync of numerous backups of 1.3gb in size each accross a gigabit ethernet (same interface as cluster traffic) Which inturn caused the heartbeat misfires.
[11 May 2006 7:15] Valeriy Kravchuk
Changed category to a more appropriate one.
[9 Aug 2006 17:02] Jonas Oreland
Hi,

can you retest this bug with patch for http://bugs.mysql.com/bug.php?id=20904
 applied?

/Jonas
[16 Aug 2006 6:05] Adam Dixon
Hi Jonas,
I downloaded and installed from the bk tree, 5.0.25-log, and could not reproduce the problem. I ran up two instances of tiobench on the system on two different partitions. Which only caused delays between checkpoints (disk contention).

To ensure that this did reproduce the issue, I loaded back 5.0.21 and did the same thing, and it only took about 1 minute for it to be Arbitrarily shutdown and dump a trace etc.

Not sure if this would be a satisfactory reproduction, can you advise on what I need to do if this is the case.
[16 Aug 2006 7:10] Jonas Oreland
Hi

Thx a lot for the effort.

I think will qualify fine as "reproduce",
i'll close the bug report.

/Jonas