Bug #69959 unable to start data node
Submitted: 8 Aug 2013 3:56 Modified: 8 Feb 2014 4:59
Reporter: Firzen Le Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.3.2 OS:Linux (centos 5.x)
Assigned to: CPU Architecture:Any

[8 Aug 2013 3:56] Firzen Le
Description:
I have a cluster with 4 data nodes, 2 replicas, 2 management nodes and 1 sql node.
I stopped a data node for a few days and continue running the cluster.
After that I tried to start the node and got this error:

 "Node 13: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'."

If I try to use ndbd --initial, it can start, but as far as I know, --initial will cause data losing, it not safe except some special case (ex upgrade version...).

Here's my config.ini:

[ndbd default]
# Options affecting ndbd processes on all data nodes:
datadir=/var/lib/mysql_ndbd/data   # Directory for this data node's data files
NoOfReplicas=2    # Number of replicas
DataMemory=2000M    # How much memory to allocate for data storage
IndexMemory=512M   # How much memory to allocate for index storage
                  # For DataMemory and IndexMemory, we have used the
                  # default values. Since the "world" database takes up
                  # only about 500KB, this should be more than enough for
                  # this example Cluster setup.
MaxNoOfTables=3000
MaxNoOfAttributes=30000
DiskPageBufferMemory=128M
MaxNoOfOrderedIndexes= 512
MaxNoOfConcurrentOperations=250000 # 800000
MaxNoOfLocalOperations=270000

[tcp default]
# TCP/IP options:
portnumber=2202   # This the default; however, you can use any
                  # port that is free for all the hosts in the cluster
                  # Note: It is recommended that you do not specify the port
                  # number at all and simply allow the default value to be used
                  # instead

[ndb_mgmd]
# Management process options:
NodeId=1
hostname=192.168.1.51           # Hostname or IP address of MGM node
datadir=/var/lib/mysql-cluster  # Directory for MGM node log files
LogDestination=FILE:filename=cluster.log,maxsize=1000000,maxfiles=6

[ndb_mgmd]
# Management process options:
NodeId=2
hostname=192.168.1.55           # Hostname or IP address of MGM node
datadir=/var/lib/mysql-cluster  # Directory for MGM node log files
LogDestination=FILE:filename=cluster.log,maxsize=1000000,maxfiles=6

[ndbd]
NodeId=11
NodeGroup=0
hostname=192.168.1.51           # Hostname or IP address

[ndbd]
NodeId=12
NodeGroup=0
hostname=192.168.1.52           # Hostname or IP address

[ndbd]
NodeId=13
NodeGroup=1
hostname=192.168.1.54           # Hostname or IP address

[ndbd]
NodeId=14
NodeGroup=1
hostname=192.168.1.59           # Hostname or IP address

[mysqld]
# SQL node options:
NodeId=41
hostname=192.168.1.51           # Hostname or IP address

[mysqld]
NodeId=42
hostname=192.168.1.56

[mysqld]

Does anyone has any idea of this? 
Thanks in advanced.

How to repeat:
Stop a data node for a few days.
Continue running the cluster (including import new data).
Start the node again.
[8 Aug 2013 8:07] Hartmut Holzgraefe
Would be interesting to see a core dump for this ...

But if your node was down for several days already then doing an --initial restart doesn't harm, its local checkpoints will be so out of date by now that it would do a full sync anyway ...
[10 Aug 2013 3:59] Firzen Le
Hi Hartmut,
Thanks for replying.
I'm gonna try to enable OS's core dump file then paste it in here.
By the way, may I ask if one day I have to use ndbd --initial for all data node, have I got any chance to get my data back?
[10 Aug 2013 6:27] Hartmut Holzgraefe
If you do a full initial restart ... well, better have a recent backup that you can restore from then, and binlogs enabled on mysqld nodes so that you can do point-in-time recovery from these for the span between time of the backup and time when cluster stopped working ...
[8 Jan 2014 4:59] MySQL Verification Team
Hello Firzen,

Thank you for the report.
I couldn't reproduce the reported issue. Is this issue still repeatable(with latest GA 7.3.3)? 

Also, could you please attach the cluster logs? Preferably using the ndb_error_reporter utility:

  http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-programs-ndb-error-reporter.html

Thanks,
Umesh
[9 Feb 2014 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".