Bug #21172 Cluster will not restart, configuration error
Submitted: 20 Jul 2006 3:47 Modified: 1 Sep 2006 13:29
Reporter: Jason Downing Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S2 (Serious)
Version:5.1.11 OS:Linux (Debian Sarge 3.1)
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: cluster will not restart, configuration error, Forced node shutdown completed

[20 Jul 2006 3:47] Jason Downing
Description:
Cluster will start with --initial, but will not restart normally. If there is no tablespace added (and therefore no disk data tables) the problem goes away.

Here is my configuration:

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=250M
IndexMemory=50M
MaxNoOfAttributes=3000
MaxNoOfConcurrentOperations=1000000
StartFailureTimeout=1000000
StartPartialTimeout=200000
LogLevelStartup=15

[NDB_MGMD]
hostname=192.168.0.17
datadir=/var/lib/mysql-cluster
Id=1

[NDBD]
hostname=192.168.0.10
datadir=/usr/local/mysql/data
LogLevelStartup=15
Id=2

[NDBD]
hostname=192.168.0.11
datadir=/usr/local/mysql/data
LogLevelStartup=15
Id=3

[MYSQLD]
hostname=192.168.0.13
Id=4

[MYSQLD]
hostname=192.168.0.14
Id=5

[MYSQLD]
hostname=192.168.0.15
Id=6

Then on each data node:

[MYSQL_CLUSTER]
ndb-connectstring=192.168.0.17

And each SQL node:

[MYSQLD]
ndbcluster
ndb-connectstring=192.168.0.17

I can start the cluster with --initial, and here is the ndb_mgm output:

ndb_mgm> Node 2: Started (version 5.1.11)
Node 3: Started (version 5.1.11)
show
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @192.168.0.10  (Version: 5.1.11, Nodegroup: 0, Master)
id=3    @192.168.0.11  (Version: 5.1.11, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @192.168.0.17  (Version: 5.1.11)

[mysqld(API)]   3 node(s)
id=4    @192.168.0.13  (Version: 5.1.11)
id=5    @192.168.0.14  (Version: 5.1.11)
id=6    @192.168.0.15  (Version: 5.1.11)

Next I can add a logfile group, undo file, tablespace and data file with the following two commands: (this takes 4 minutes or so)

CREATE LOGFILE GROUP lg_1
    ADD UNDOFILE 'undo_1.dat'
    INITIAL_SIZE 100M
    UNDO_BUFFER_SIZE 2M
    ENGINE NDB;

CREATE TABLESPACE ts_1
    ADD DATAFILE 'data_1.dat'
    USE LOGFILE GROUP lg_1
    INITIAL_SIZE 10000M
    ENGINE NDB;

After this if I shut the cluster down with the shutdown command in ndb_mgm, then  restart without --initial, this is what happens in ndb_mgm:

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     2 node(s)
id=2    @192.168.0.10  (Version: 5.1.11, starting, Nodegroup: 0, Master)
id=3    @192.168.0.11  (Version: 5.1.11, starting, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1   (Version: 5.1.11)

[mysqld(API)]   3 node(s)
id=4 (not connected, accepting connect from 192.168.0.13)
id=5 (not connected, accepting connect from 192.168.0.14)
id=6 (not connected, accepting connect from 192.168.0.15)

ndb_mgm> Node 2: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2812: 'Invalid parameter for file(Configuration error). Permanent error, external action needed'.
Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

How to repeat:
Do what I just described. I imagine it could be done without three sql nods and it would have the same effect. Please note that I have not loaded a data set into the d/b or created any databases or tables.

I am happy to provide whatever is required to resolve this problem. I really want to use the disk based cluster, but at this stage I cannot, because of this problem.
[20 Jul 2006 23:28] Jason Downing
Complete mysql-cluster directory

Attachment: mysql-cluster.zip (application/zip, text), 122.32 KiB.

[20 Jul 2006 23:29] Jason Downing
TraceLogs node 2

Attachment: data directory node 2.zip (application/zip, text), 180.49 KiB.

[20 Jul 2006 23:30] Jason Downing
TraceLogs node 3

Attachment: data directory node 3.zip (application/zip, text), 179.13 KiB.

[2 Aug 2006 14:56] Jonas Oreland
Hi,

I just retested this wo/ problem...

I have a AMD64 running linux-2.6.12

What hw/os do you have ?

(I ask as you have a 10G datafile, and I'm not sure if 
 that is supported in 32-bit os...actually I doubt it very much...)

/Jonas
[2 Aug 2006 23:29] Jason Downing
The hardware on all nodes is the same, AMD Sempron 2400+ 32 bit. OS is Debian Linux 2.4.27-2-386, the standard kernel that ships with Debian Sarge.

With a bit of notice (2 days) I can set up my machines with 5.1 again (we have changed back to 5.0 because of constant crashing (which has been reported by someone else)) and give you access. You will be able to reproduce the problem using my machines.

If 10 GB is not supported for my o/s or hardware, what can I have with my hardware? Do I need a 2.6 kernel?
[3 Aug 2006 7:11] Jason Downing
Could you tell me which linux distribution you are using? I would like to run a test with the same distro as you are using. Thanks
[3 Aug 2006 7:31] Jonas Oreland
Hi,

I'm using gentoo
But as I said, I have a 64-bit machine with a 64-bit kernel...
I will retest your testcase on a 32-bit machine
  (both 2.4 and 2.6 kernel) and see how it goes...

/Jonas
[3 Aug 2006 23:07] Jason Downing
I am currently setting up a new cluster in 5.1 to test the same thing using a smaller data file. I suspect I have already tried this, but I can't really remember. I will use the same o/s setup that I had before. I will advise the results.
[4 Aug 2006 11:33] Jonas Oreland
Ok
looking forward to results...
i have a hard time finding a machine with 2.4.X with that much diskspace :-(

Another maybe interesting question is what kind of filesystem you use
  ext2/ext3/reiserfs etc...

/Jonas
[7 Aug 2006 7:11] Jason Downing
Results of test are that it fails again. I used a 200 MB data file this time. I have this test on a dedicated setup, so you can connect with ssh and see it for yourself. File system is ext3. I'm having some problems with our router and I haven't got access up and running for you yet. Will get back to you when this problem is resolved. These were the two SQL commands used to set up data files:

CREATE LOGFILE GROUP lg_1
    ADD UNDOFILE 'undo_1.dat'
    INITIAL_SIZE 10M
    UNDO_BUFFER_SIZE 2M
    ENGINE NDB;

CREATE TABLESPACE ts_1
    ADD DATAFILE 'data_1.dat'
    USE LOGFILE GROUP lg_1
    INITIAL_SIZE 200M
    ENGINE NDB;

Then I stopped the cluster (I had not added any databases or tables or data), and tried to restart without --initial. Same problem restarting.
[7 Aug 2006 7:15] Jonas Oreland
Hi,

thx very much for the help.

waiting for ssh stuff

/Jonas
[8 Aug 2006 8:39] Jonas Oreland
Hi,

thx for ssh login and such.

Now I know what problem is...
  (which I also would have found if looking closer in ndb_x_out.log)

The problem will most certainly go away if upgrading to 2.6 kernel.
But I will ofcourse fix it...

Do you download binary release, or do you compile from source.
If you compile from source, I can send you a patch...

/Jonas
[8 Aug 2006 23:08] Jason Downing
I use the binaries. I think I will wait for the next release for it to be fixed. Also, I will look into upgrading the kernel. Thanks for your help with the problem, Jason
[10 Aug 2006 14:13] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10270

ChangeSet@1.2265, 2006-08-10 16:12:54+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#21172
    Handle also open && !OM_INIT wrt non function O_DIRECT
[15 Aug 2006 5:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10408

ChangeSet@1.2270, 2006-08-15 07:52:27+02:00, jonas@perch.ndb.mysql.com +2 -0
  ndb - bug#21172
    Fix build failure if O_DIRECT is not defined
    Fix stack overflow by making odirect_readbuf global
    Remove soem old debug variables
[1 Sep 2006 8:11] Jonas Oreland
pushed to 5.1.12
[1 Sep 2006 13:29] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 5.1.12 changelog. Changed category to Cluster Disk Data.