Bug #62857 GCP_SAVEREQ node crash
Submitted: 21 Oct 2011 9:27 Modified: 21 Nov 2011 10:41
Reporter: Eugene Zheganin Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.1.15a OS:Linux (Linux ip-10-44-115-59 2.6.35.14-95.38.amzn1.x86_64 #1 SMP Thu Aug 25 17:11:23 UTC 2011 x86_64 x86_64)
Assigned to: Assigned Account CPU Architecture:Any

[21 Oct 2011 9:27] Eugene Zheganin
Description:
I am testing cluster configuration on an Amazon EC2 servers.
While restoring a 15Gigs database, I got one node crashed.

Partial info (full logs and configs attached):

2011-10-21 08:56:34 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=100                        
2011-10-21 08:56:34 [ndbd] INFO     -- Watchdog: User time: 83966  System time: 27479                                   
2011-10-21 08:56:34 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Job Handling elapsed=200                        
2011-10-21 08:56:34 [ndbd] INFO     -- Watchdog: User time: 83966  System time: 27479                                   
2011-10-21 08:56:34 [ndbd] INFO     -- Please report this as a bug. Provide as much info as possible, expecially all the
 ndb_*_out.log files, Thanks. Shutting down node due to failed handling of GCP_SAVEREQ                                  
2011-10-21 08:56:34 [ndbd] INFO     -- DBLQH (Line: 22788) 0x00000002                                                   
2011-10-21 08:56:34 [ndbd] INFO     -- Error handler shutting down system                                               
2011-10-21 08:56:34 [ndbd] INFO     -- Error handler shutdown completed - exiting                                       
2011-10-21 08:56:36 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2303: 'System error, node
 killed during node restart by other node(Internal error, programming error or missing error message, please report a bu
g). Temporary error, restart node'.

How to repeat:
Dunnow yet if it's repeatable.
[21 Oct 2011 9:36] Eugene Zheganin
configs, traces and stuff

Attachment: configs-and-stuff.tar.gz (application/x-gzip, text), 81.93 KiB.

[21 Oct 2011 10:41] Jonas Oreland
Hi Eugene

I examined the files you uploaded,
and is fairly (read very) sure that the problem
steams from the "RedoOverCommitCounter=0"
which I find in your config.ini

I'm not sure why you put it there, but it effectively
prevents cluster internal disk overload handling,
which ultimately makes the cluster fails...as
other part of system thinks that disk writing is too slow.

My guess to why you added it is as you experienced aborted
transaction refering to slow disk io
But I suggest that you remove this setting and instead add
[mysqld default]
DefaultOperationRedoProblemAction=queue

(see http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-api-definition.html#ndbparam-api-defa...)

This will make transactions delayed (queued) instead of aborted
is disk writing is slow.

/Jonas
[22 Nov 2011 7:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".