MySQL Bugs: #18841: Multiple MySQL Cluster NDB engines die in Disk based cluster

Bug #18841	Multiple MySQL Cluster NDB engines die in Disk based cluster
Submitted:	6 Apr 2006 9:55	Modified:	29 Jun 2006 18:48
Reporter:	Kris Buytaert (Candidate Quality Contributor)	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.1.6 alpha	OS:	Linux (Linux)
Assigned to:		CPU Architecture:	Any

Description:
I've setup a disk based ndb setup as documented in on http://mikaelronstrom.blogspot.com/2006/02/how-to-define-table-that-uses-disk.html

Each couple of days both ndb nodes in my cluster die with 

ACCESS1-DB-B:/var/lib/mysql/mysql-cluster # more ndb_3_error.log
Current byte-offset of file-pointer is: 568

Time: Wednesday 5 April 2006 - 20:13:46
Status: Temporary error, restart node
Message: Arbitrator shutdown, please investigate error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 3826) 0x0000000e
Program: ndbd
Pid: 2856
Trace: /var/lib/mysql/mysql-cluster//ndb_3_trace.log.1
Version: Version 5.1.6 (alpha)
***EOM***

In this case that's about 2 hours after I last touched the cluster.
Both nodes die within 2-3 minutes time.

I've already repeated this setup twice and the only thing that differs is the time it takes for the cluster to crash.

I`m attaching the trace file of one of the crashing ndb's.

How to repeat:
ndbd --initial
create tablespaces and tables as described in Mikaels blog entry
create some entries
wait
wait
crash

Trace of an ndbd crash

Attachment: ndb_3_trace.log.1.gz (application/x-gzip, text), 41.11 KiB.

Hi,

1) Can you upload *trace*error* files, cluster log & config.ini?
2) Many bugs has been fixed since 5.1.6 
3) Reading the tracefile that you uploaded I can find nothing actually related to disk based.
    (but it's hard to tell with only one of the tracefiles)

Thx
/jonas

2nd log file

Attachment: ndb_2_trace.log.1.gz (application/x-gzip, text), 42.75 KiB.

config.ini

Attachment: config.ini (application/octet-stream, text), 1.07 KiB.

Cluster log

Attachment: ndb_1_cluster.log.gz (application/x-gzip, text), 42.60 KiB.

2nd trace file uploaded .. 
The full errorlog on 1 nod has already been pasted in the original submission , the 2nd one is identical 
ACCESS1-DB-A:/var/lib/mysql/mysql-cluster # more ndb_2_error.log
Current byte-offset of file-pointer is: 568

Time: Wednesday 5 April 2006 - 20:12:36
Status: Temporary error, restart node
Message: Arbitrator shutdown, please investigate error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 3826) 0x0000000e
Program: ndbd
Pid: 2436
Trace: /var/lib/mysql/mysql-cluster//ndb_2_trace.log.1
Version: Version 5.1.6 (alpha)
***EOM***

Other files have been uploaded.

I`ll try upgrading to a more recent version.  The reason why I suspect it is diskbased related is that this is the only change I made before the crashes started hapenning is the tablespace creation etc.

Hi,

I can not find any explanation other than network failure of some kind.
Both node conclude that the other has died, as it seems wo/ any obvoius reason.

Do you still get it?
Have you tried never version?
Can you try to capture a sql sequence that triggers the problem?

/Jonas

Upgrades are on my todolist .. but I didn't have time yet...

The cluster is completely idle, so there won't be any sql statements running that can be reproduced.

I`ll get back as soon as I have my test platforms upgraded to a more recent distribution.

Please, try to use newer version, 5.1.9 (or 5.1.10, to be released soon), in your further tests.

I've updated to 5.1.9 in the meanwhile..  my cluster has been up for 5 days now it seems like the problem has been fixed in the more recent versions.

I`ll continue runnnig tests to se if it stays stable :)

Please, reopen this bug report in case of similar problem with 5.1.9.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".