Bug #8700 Dump problem with Ndb Cluster
Submitted: 22 Feb 2005 17:20 Modified: 25 Mar 2005 9:40
Reporter: Noor Ali Jindani Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:4.1.10 OS:Solaris (Solaris 10)
Assigned to: Assigned Account CPU Architecture:Any

[22 Feb 2005 17:20] Noor Ali Jindani
Description:
I  have been trying to get the dump imported on the ndb cluster since last two weeks but every time something strange happens after which my cluster fails to start and the only remedy is to start the storage nodes with the --initial parameter which drops all of my tables.

ERROR 1297 (HY000): Got Temporary error 4010 'Node failure caused abort of transaction' from ndb cluster
 
after the above error every thing just goes down and the following errors just keep on coming
ERROR 1015 (HY000): Can't lock file (errorno: 4009) which i think is related to the storage nodes going down.
 
the architecture of the cluster is that we have two servers each running the mysqld and ndbd process and one server acting as the manager ... 
before that we also tried setting the NoOfReplicas=1 and started importing the dump  but the same errors followed after which we tried to do the whole thing again with setting NoOfReplicas=2 but no progress.
 
the configurations are as follows:
config.ini
 
[NDBD DEFAULT]
NoOfReplicas=2
IndexMemory=756M
DataMemory=3072M
MaxNoOfAttributes=10000
MaxNoOfOrderedIndexes=8000
MaxNoOfUniqueHashIndexes=5000
[MYSQLD DEFAULT]
[NDB_MGMD DEFAULT]
[TCP DEFAULT]
# Managment Server
[NDB_MGMD]
HostName=192.168.10.35    # the IP of THIS SERVER
# Storage Engines
[NDBD]
HostName=192.168.10.1 # the IP of the FIRST SERVER
DataDir= /var/lib/mysql-cluster
[NDBD]
HostName=192.168.10.2           # the IP of the SECOND SERVER
DataDir=/var/lib/mysql-cluster
# 2 MySQL Clients
[MYSQLD]
[MYSQLD]
 
the error log files are as follows:
[for storage node 2 -- 192.168.10.2]
 
Date/Time: Tuesday 22 February 2005 - 18:23:48
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbtupGen.cpp
Object of reference: DBTUP (Line: 484) 0x0000000e
ProgramName: ndbd
ProcessID: 13743
TraceFile: /var/lib/mysql-cluster/ndb_3_trace.log.10
***EOM***
 
[for storage node  1 -- 192.168.10.1]
Date/Time: Tuesday 22 February 2005 - 12:47:53
Type of error: error
Message: File has already been opened
Fault ID: 2807
Problem data:
Object of reference: OpenFiles::insert()
ProgramName: ndbd
ProcessID: 13639
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.7
***EOM***
 

Date/Time: Tuesday 22 February 2005 - 12:51:31
Type of error: error
Message: No message slogan found
Fault ID: 0
Problem data:
Object of reference: DBDICT (Line: 0) 0x0000000a
ProgramName: ndbd
ProcessID: 13709
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.8
***EOM***
 
The specs fo the servers running the storage nodes are as follows:
192.168.10.1
Ram: 8GB
HD: SCSI 80GB
Sun SPARC Enterprise 3500
OS: Solaris 10
running ndbd, mysqld
 

192.168.10.2
Ram: 8GB
HD: SCSI 80GB
Sun SPARC Enterprise 3500
OS: Solaris 9
running ndbd, mysqld
 
192.168.10.35
Ram: 2GB
HD: SCSI 40GB
Sun SPARC Enterprise 420R
OS: Solaris 10
running ndb_mgmd

How to repeat:
n/a

Suggested fix:
n/a
[23 Feb 2005 7:22] Jonas Oreland
Hi,

The error log indicates disk problems.
To store 1Gb in the database, you need ~3Gb on disk as a result of checkpointing algorithm.
Handling of disk full is currently not very user friendly...

To test if the problem is there, you can add "Diskless=1" in your config file.
This will make the cluster not to use the disk at all.
Hence disk full will never be a problem.
(But you'll loose your data if you shutdown the cluster)

Otherwise,
Is it possible to:
1) upload testdata so we can try to reproduce it at our site.
2) upload the trace files mentioned in the error logs.

/Jonas
[26 Mar 2005 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".