Bug #16308 | Storage node fails to restart after failure | ||
---|---|---|---|
Submitted: | 9 Jan 2006 18:32 | Modified: | 11 Apr 2006 19:42 |
Reporter: | Jon Anderson | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.0.16 (max) | OS: | Linux (Linux (Debian, 2.6.12/14 kernel)) |
Assigned to: | CPU Architecture: | Any |
[9 Jan 2006 18:32]
Jon Anderson
[17 Jan 2006 20:02]
Jon Anderson
Here is another instance of this issue, this time with 5.0.18. The ndbd process has been in startup phase 5 for just over a day now. This time, the ndbd process was killed by the kernel (OOM) because of a runaway process run by a developer January 16 2006 at about 14:35 EST, and restarted about 3 minutes later. It is still "starting" now - Jan 17, 14:57 EST. (There is no mysql error associated with the ndbd process termination, because it was not a mysql fault.) Trying to start ndbd just doesn't work. Running either 'ndbd' or 'ndbd --initial' leads to the same thing: "id=4 @<ip> (Version: 5.0.18, starting, Nodegroup: 0)" The last lines of ndb_4_out.log shows no useful information as to why it won't start: --- ndb_4_out.log snippet --- 2006-01-16 14:35:28 [ndbd] INFO -- Angel pid: 26759 ndb pid: 21916 2006-01-16 14:35:28 [ndbd] INFO -- NDB Cluster -- DB node 4 2006-01-16 14:35:28 [ndbd] INFO -- Version 5.0.18 -- 2006-01-16 14:35:28 [ndbd] INFO -- Configuration fetched at 192.168.1.11 port 1186 2006-01-16 14:35:28 [ndbd] INFO -- Start initiated (version 5.0.18) --- end ndb_4_out.log --- The last lines of the cluster log also do not seem to show useful information. --- ndb_2_cluster.log snippet --- 2006-01-16 14:34:53 [MgmSrvr] INFO -- Node 3: Communication to Node 4 opened 2006-01-16 14:35:30 [MgmSrvr] INFO -- Node 2: Node 4 Connected 2006-01-16 14:37:51 [MgmSrvr] INFO -- Node 4: Start phase 0 completed 2006-01-16 14:37:51 [MgmSrvr] INFO -- Node 4: Communication to Node 3 opened 2006-01-16 14:37:51 [MgmSrvr] INFO -- Node 3: Node 4 Connected 2006-01-16 14:37:51 [MgmSrvr] INFO -- Node 4: Node 3 Connected 2006-01-16 14:37:51 [MgmSrvr] INFO -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 6 2006-01-16 14:37:51 [MgmSrvr] INFO -- Node 3: Node 4: API version 5.0.18 2006-01-16 14:37:51 [MgmSrvr] INFO -- Node 4: Node 3: API version 5.0.18 2006-01-16 14:37:51 [MgmSrvr] INFO -- Node 4: Start phase 1 completed 2006-01-16 14:37:52 [MgmSrvr] INFO -- Node 4: Start phase 2 completed (initial node restart) 2006-01-16 14:37:52 [MgmSrvr] INFO -- Node 4: Receive arbitrator node 1 [ticket=00590005d4b66637] 2006-01-16 14:38:47 [MgmSrvr] INFO -- Node 4: Start phase 3 completed (initial node restart) 2006-01-16 14:38:53 [MgmSrvr] INFO -- Node 4: Start phase 4 completed (initial node restart) --- end ndb_2_cluster.log --- I don't know what other information I can provide, because there isn't any "error" of any kind. It just doesn't start. Is there any way I can get more information from the cluster to send to you?
[22 Mar 2006 9:52]
Volker Voigt
I encountered a similiar problem. I'm running 4.1.18 with 2 storage nodes, 2 SQL nodes and 2 management nodes. The storage nodes are running with 12GB RAM each on a 64bit RedHat ES. During my initial tests killing a storage node and bringing it back was no problem. After filling the cluster with data 9GB worth I killed the nodegroup master again. Restarting it with "ndbd" brings the same symptoms as told by the other posters (my logfile looks identical). I waited for several hours with my storage nodes idleing. I killed it again and tried it with --initial. This time the nodes idled for 25 minutes until the restarted node begun catching up. (During both restart tests I didn't induce any select or update load on the cluster from the SQL nodes!) So I was wondering if any data was sent between the nodes during the initial idle times - and found something interesting that might be a lead. The management nodes continuously bombard the restarting node with TCP connections, all saying the same: ndbd ndbd passwd OK 31 1 Please find attached to this bug a pcap file with 2 seconds of captures (cleaned from other stuff that was traced). This capture was done during the above mentioned 25 minutes idle time (that luckily turned into a working catch-up).
[22 Mar 2006 10:15]
Volker Voigt
I just found out I'm not allowed to attach a file to this bug as I'm neither the original poster nor a developer. So below is a plain text output of etherreal of the first connection I traced. If anybody is interested in getting the pcap file please let me know and I'll send it to you. During I write this I notice that EVERY answer packet from the storage node containing the "ok" has a wrong TCP checksum! Jon, could you please tcpdump such an event and look if you also encounter wrong tcp checksums? I know that the ndbd process is not responsible for wrong TCP checksums, but I find it worthwhile mentioning that only exactly one type of payload gets corrupted every time. No. Time Source Destination Protocol Info 1 0.000000 192.168.192.100 192.168.192.130 TCP 33044 > 2202 [SYN] Seq=0 Ack=0 Win=5840 Len=0 MSS=1460 TSV=136845986 TSER=0 WS=2 0000 00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00 ..#.....#.....E. 0010 00 3c ac ae 40 00 40 06 8b d5 c0 a8 c0 64 c0 a8 .<..@.@......d.. 0020 c0 82 81 14 08 9a eb ba b6 0c 00 00 00 00 a0 02 ................ 0030 16 d0 e0 bc 00 00 02 04 05 b4 04 02 08 0a 08 28 ...............( 0040 1a a2 00 00 00 00 01 03 03 02 .......... No. Time Source Destination Protocol Info 2 0.000016 192.168.192.130 192.168.192.100 TCP 2202 > 33044 [SYN, ACK] Seq=0 Ack=1 Win=5792 Len=0 MSS=1460 TSV=419220897 TSER=136845986 WS=2 0000 00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00 ..#.....#.....E. 0010 00 3c 00 00 40 00 40 06 38 84 c0 a8 c0 82 c0 a8 .<..@.@.8....... 0020 c0 64 08 9a 81 14 62 e2 0c bd eb ba b6 0d a0 12 .d....b......... 0030 16 a0 8a 9e 00 00 02 04 05 b4 04 02 08 0a 18 fc ................ 0040 cd a1 08 28 1a a2 01 03 03 02 ...(...... No. Time Source Destination Protocol Info 3 0.000501 192.168.192.100 192.168.192.130 TCP 33044 > 2202 [ACK] Seq=1 Ack=1 Win=5840 Len=0 TSV=136845987 TSER=419220897 0000 00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00 ..#.....#.....E. 0010 00 34 ac af 40 00 40 06 8b dc c0 a8 c0 64 c0 a8 .4..@.@......d.. 0020 c0 82 81 14 08 9a eb ba b6 0d 62 e2 0c be 80 10 ..........b..... 0030 05 b4 ca 50 00 00 01 01 08 0a 08 28 1a a3 18 fc ...P.......(.... 0040 cd a1 .. No. Time Source Destination Protocol Info 4 0.000517 192.168.192.100 192.168.192.130 TCP 33044 > 2202 [PSH, ACK] Seq=1 Ack=1 Win=5840 Len=5 TSV=136845987 TSER=419220897 0000 00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00 ..#.....#.....E. 0010 00 39 ac b0 40 00 40 06 8b d6 c0 a8 c0 64 c0 a8 .9..@.@......d.. 0020 c0 82 81 14 08 9a eb ba b6 0d 62 e2 0c be 80 18 ..........b..... 0030 05 b4 ef 7a 00 00 01 01 08 0a 08 28 1a a3 18 fc ...z.......(.... 0040 cd a1 6e 64 62 64 0a ..ndbd. No. Time Source Destination Protocol Info 5 0.000526 192.168.192.130 192.168.192.100 TCP 2202 > 33044 [ACK] Seq=1 Ack=6 Win=5792 Len=0 TSV=419220898 TSER=136845987 0000 00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00 ..#.....#.....E. 0010 00 34 c1 1c 40 00 40 06 77 6f c0 a8 c0 82 c0 a8 .4..@.@.wo...... 0020 c0 64 08 9a 81 14 62 e2 0c be eb ba b6 12 80 10 .d....b......... 0030 05 a8 ca 56 00 00 01 01 08 0a 18 fc cd a2 08 28 ...V...........( 0040 1a a3 .. No. Time Source Destination Protocol Info 6 0.001001 192.168.192.100 192.168.192.130 TCP 33044 > 2202 [PSH, ACK] Seq=6 Ack=1 Win=5840 Len=12 TSV=136845987 TSER=419220898 0000 00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00 ..#.....#.....E. 0010 00 40 ac b1 40 00 40 06 8b ce c0 a8 c0 64 c0 a8 .@..@.@......d.. 0020 c0 82 81 14 08 9a eb ba b6 12 62 e2 0c be 80 18 ..........b..... 0030 05 b4 a0 08 00 00 01 01 08 0a 08 28 1a a3 18 fc ...........(.... 0040 cd a2 6e 64 62 64 20 70 61 73 73 77 64 0a ..ndbd passwd. No. Time Source Destination Protocol Info 7 0.001006 192.168.192.130 192.168.192.100 TCP 2202 > 33044 [ACK] Seq=1 Ack=18 Win=5792 Len=0 TSV=419220898 TSER=136845987 0000 00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00 ..#.....#.....E. 0010 00 34 c1 1e 40 00 40 06 77 6d c0 a8 c0 82 c0 a8 .4..@.@.wm...... 0020 c0 64 08 9a 81 14 62 e2 0c be eb ba b6 1e 80 10 .d....b......... 0030 05 a8 ca 4a 00 00 01 01 08 0a 18 fc cd a2 08 28 ...J...........( 0040 1a a3 .. No. Time Source Destination Protocol Info 8 0.001062 192.168.192.130 192.168.192.100 TCP 2202 > 33044 [PSH, ACK] Seq=1 Ack=18 Win=5792 [TCP CHECKSUM INCORRECT] Len=3 TSV=419220898 TSER=136845987 0000 00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00 ..#.....#.....E. 0010 00 37 c1 20 40 00 40 06 77 68 c0 a8 c0 82 c0 a8 .7. @.@.wh...... 0020 c0 64 08 9a 81 14 62 e2 0c be eb ba b6 1e 80 18 .d....b......... 0030 05 a8 02 62 00 00 01 01 08 0a 18 fc cd a2 08 28 ...b...........( 0040 1a a3 6f 6b 0a ..ok. No. Time Source Destination Protocol Info 9 0.001502 192.168.192.100 192.168.192.130 TCP 33044 > 2202 [ACK] Seq=18 Ack=4 Win=5840 Len=0 TSV=136845987 TSER=419220898 0000 00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00 ..#.....#.....E. 0010 00 34 ac b2 40 00 40 06 8b d9 c0 a8 c0 64 c0 a8 .4..@.@......d.. 0020 c0 82 81 14 08 9a eb ba b6 1e 62 e2 0c c1 80 10 ..........b..... 0030 05 b4 ca 3b 00 00 01 01 08 0a 08 28 1a a3 18 fc ...;.......(.... 0040 cd a2 .. No. Time Source Destination Protocol Info 10 0.001507 192.168.192.100 192.168.192.130 TCP 33044 > 2202 [PSH, ACK] Seq=18 Ack=4 Win=5840 Len=5 TSV=136845987 TSER=419220898 0000 00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00 ..#.....#.....E. 0010 00 39 ac b3 40 00 40 06 8b d3 c0 a8 c0 64 c0 a8 .9..@.@......d.. 0020 c0 82 81 14 08 9a eb ba b6 1e 62 e2 0c c1 80 18 ..........b..... 0030 05 b4 6c cc 00 00 01 01 08 0a 08 28 1a a3 18 fc ..l........(.... 0040 cd a2 33 31 20 31 0a ..31 1. No. Time Source Destination Protocol Info 11 0.001544 192.168.192.130 192.168.192.100 TCP 2202 > 33044 [FIN, ACK] Seq=4 Ack=23 Win=5792 Len=0 TSV=419220899 TSER=136845987 0000 00 04 23 c6 f3 1c 00 04 23 c6 f4 0e 08 00 45 00 ..#.....#.....E. 0010 00 34 c1 22 40 00 40 06 77 69 c0 a8 c0 82 c0 a8 .4."@.@.wi...... 0020 c0 64 08 9a 81 14 62 e2 0c c1 eb ba b6 23 80 11 .d....b......#.. 0030 05 a8 ca 40 00 00 01 01 08 0a 18 fc cd a3 08 28 ...@...........( 0040 1a a3 .. No. Time Source Destination Protocol Info 12 0.002001 192.168.192.100 192.168.192.130 TCP 33044 > 2202 [FIN, ACK] Seq=23 Ack=5 Win=5840 Len=0 TSV=136845987 TSER=419220899 0000 00 04 23 c6 f4 0e 00 04 23 c6 f3 1c 08 00 45 00 ..#.....#.....E. 0010 00 34 ac b4 40 00 40 06 8b d7 c0 a8 c0 64 c0 a8 .4..@.@......d.. 0020 c0 82 81 14 08 9a eb ba b6 23 62 e2 0c c2 80 11 .........#b..... 0030 05 b4 ca 33 00 00 01 01 08 0a 08 28 1a a3 18 fc ...3.......(.... 0040 cd a3 ..
[28 Mar 2006 20:46]
David Dawe
I have encountered the same problem with 5.0.19 (mysql-max-5.0.19-solaris10-sparc-64bit.tar.gz). I have a small test configuration: 2 machines with a storage node and SQL node each, and a third machine running a management node. I am testing recovery from a data node failure. I induce the data node to fail by sending it either a SIGKILL or SIGSEGV, typically while performing insertion operations at a rate of around 15 per second. As with the other postings, the data node sometimes fails to complete its startup, hanging in Phase 5. During the recovery time, there is no database activity. While it is hard to establish a firm correlation, it seems that this problem is more likely to occur if the data node was killed after it had been running for a while (say, an hour or more). The only way that I have been able to work around the problem is to restart the whole cluster. The normal startup time for the data node in my test environment is typically between 1.5 and 3 minutes (the database contains about 10000 records). As a result, I normally wait only about 5 minutes before I consider the node startup to be hung, and restart the whole cluster. However, I have seen cases (when I was patient enough) where the node had remained hung overnight. I agree with the original posting that this bug may be related to http://bugs.mysql.com/bug.php?id=14408, even though I am have performing only about 15 insertions per second (though each insertion is of 100 records) when the data node is induced to fail. As well, ndbd --initial does not help me.
[8 Apr 2006 11:39]
Jonas Oreland
Since we have been able to reproduce this, could you try a newer 5.0 version...there has been a few bug fixes that could match your problem
[8 Apr 2006 14:58]
David Dawe
More recent than the 5.0.19 that I am using? It appears to me that 5.0.19 is the most recent version. Or do you mean more recent than the 5.0.18 that Jon Anderson was using?
[8 Apr 2006 15:58]
Jonas Oreland
Sorry about that... I was refering to more recent that 5.0.16 that is mentioned in header of bug...
[8 Apr 2006 16:04]
Jonas Oreland
Hi can you check if LocalCheckpoint seem to be running. The simplest way of doing this is > ndb_mgm -e "all dump 7099" Then you should see Local Checkpoint XXX soon afterwards in cluster log.
[11 Apr 2006 13:56]
David Dawe
The last time that I had conclusively seen the problem was Mar 29. My test environment would have changed only slightly since then; however, I have not been able to reproduce the problem in several attempts over the past day. I am concerned that this problem continues to exist ... the worst problems are those that cannot be reproduced. In the meantime, even though I could not reproduce the problem, I did run: > ndb_mgm -e "all dump 7099" Looking in the log, I did see "Node x: Local checkpoint xxx started. ..." for one of the nodes, but not the other. Ahh ... okay ... it seems only the node acting as the master performs the checkpoint. I will let you know when I see the problem again.
[11 Apr 2006 14:00]
Jonas Oreland
>The last time that I had conclusively seen the problem was Mar 29. My test >environment would have changed only slightly since then; however, I have not >been able to reproduce the problem in several attempts over the past day. >I am concerned that this problem continues to exist ... the worst problems are >those that cannot be reproduced. agree :-( >In the meantime, even though I could not reproduce the problem, I did run: >> ndb_mgm -e "all dump 7099" >Looking in the log, I did see "Node x: Local checkpoint xxx started. ..." for >one of the nodes, but not the other. Ahh ... okay ... it seems only the node >acting as the master performs the checkpoint. All nodes participate in checkpoint, but only master makes this printout. >I will let you know when I see the problem again. ok, thx /Jonas
[11 Apr 2006 16:08]
Jon Anderson
I keep our cluster nodes up to date, so we're running 5.0.19 accross the board, and I'm downloading 5.0.20 as I write this. (It's not on the mirror page, but there are binaries for it which I will install on our test cluster now, and on the production cluster in a few days if all goes well.) We have not seen this bug since version 5.0.19 was installed (and maybe even earlier - I don't remember). I haven't commented again so far because I haven't seen the error recently. I don't consider this bug to be a problem anymore simply because it hasn't occurred in months, where it used to occur often. We had a critical failure last night involving data loss (which I'm about to file a bug about), one node had to be restarted but it started back up normally.
[11 Apr 2006 18:08]
Jon Anderson
I spoke too soon: Currently on our development cluster: ndb_mgm> show Connected to Management Server at: 192.168.1.14:1186 Cluster Configuration --------------------- [ndbd(NDB)] 2 node(s) id=2 @192.168.1.14 (Version: 5.0.19, Nodegroup: 0, Master) id=3 @192.168.1.15 (Version: 5.0.19, starting, Nodegroup: 0) [ndb_mgmd(MGM)] 1 node(s) id=1 @192.168.1.14 (Version: 5.0.19) [mysqld(API)] 2 node(s) id=4 @192.168.1.14 (Version: 5.0.19) id=5 @192.168.1.15 (Version: 5.0.19) This time, this node is stuck in phase 1. I'm going to install 5.0.20 now. I'll keep you posted as to what I find. If there's any other piece of information you'd like from me, please let me know.
[11 Apr 2006 19:04]
Jonas Oreland
This could possibly be #18612 Which I very recently fixed (5.0.21) Anyway, please upload clusterlog when you get hanging restarts. Also, if you done any changes in your config.ini wrt Start*Timeout please upload that too. /Jonas
[11 Apr 2006 19:31]
Jon Anderson
I didn't attach the cluster log, because there isn't much to see in it. The node connects, goes through phase 0 and 1, then no further messages from that node. This doesn't appear to happen very often. I will be upgrading to 5.0.21 as soon as it's released. Thanks again. :-) -- ndb_1_cluster.log, partial -- ... 2006-04-07 11:08:20 [MgmSrvr] INFO -- Node 2: Start phase 9 completed (system restart) 2006-04-07 11:08:20 [MgmSrvr] INFO -- Node 2: Started (version 5.0.19) 2006-04-07 11:08:20 [MgmSrvr] INFO -- Node 2: Node 4 Connected 2006-04-07 11:08:20 [MgmSrvr] INFO -- Node 2: Node 5 Connected 2006-04-07 11:08:20 [MgmSrvr] INFO -- Node 2: Node 4: API version 5.0.19 2006-04-07 11:08:20 [MgmSrvr] INFO -- Node 2: Node 5: API version 5.0.19 2006-04-07 11:08:21 [MgmSrvr] INFO -- Node 2: Node 1: API version 5.0.19 2006-04-07 11:08:22 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [ticket=042c000174e4b37d] 2006-04-07 11:30:32 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 reserved for ip 192.168.1.15, m_reserved_nodes 000000000000003a. 2006-04-07 11:30:32 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2006-04-07 11:30:33 [MgmSrvr] INFO -- Mgmt server state: nodeid 3 freed, m_reserved_nodes 0000000000000032. 2006-04-07 11:31:59 [MgmSrvr] INFO -- Node 3: Start phase 0 completed 2006-04-07 11:31:59 [MgmSrvr] INFO -- Node 3: Communication to Node 2 opened 2006-04-07 11:32:29 [MgmSrvr] INFO -- Node 3: Start phase 1 completed 2006-04-07 12:05:47 [MgmSrvr] INFO -- Node 2: Local checkpoint 1828 started. Keep GCI = 2782960 oldest restorable GCI = 2782960 2006-04-07 13:23:27 [MgmSrvr] INFO -- Node 2: Local checkpoint 1829 started. Keep GCI = 2784489 oldest restorable GCI = 2784685 2006-04-07 14:41:04 [MgmSrvr] INFO -- Node 2: Local checkpoint 1830 started. Keep GCI = 2786503 oldest restorable GCI = 2784685 ...
[11 Apr 2006 19:42]
Jonas Oreland
I actually found the cluster log _very_ informative. The lack of connection establishment before start phase 1 is complete makes me almost 100% sure that this is bug#18612. I made significat changes in 5.0.21 to prevent this from happing, so I suggest changing to this version. I suggest adding the following line to your config.ini which will decrease likelyhood of this happening. StartPartialTimeout: 60000 This will force node to wait 60s before starting on it's own. (look in documentation) (in 5.0.21 this value has changed to 90s, and we're concidering changing to ~s as we also have introduced some new options for forcing partial starts) I'm also closing this as duplicate, please reopen it again if you encounter this again. /Jonas
[12 Apr 2006 13:22]
David Dawe
When I had the problem, the storage node would hang in start phase 5 (exclusively, I believe). Does the fix apply for this case?