MySQL Bugs: #41031: All ndbd nodes crash on backup failure when giving <backup id> manualy

Bug #41031	All ndbd nodes crash on backup failure when giving <backup id> manualy
Submitted:	25 Nov 2008 19:58	Modified:	19 Feb 2009 8:04
Reporter:	Daniel Salinas	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	ndb-6.3.17 ndb-6.4.0	OS:	Linux (rhel5.2)
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	ndb cluster segfault

Description:
I have created a cluster consisting of 8 total nodes, they break down as follows:

4 x Ndbd
2 x Ndb_mgmd
2 x Mysql

This cluster uses the following config:

###################
# Mangement nodes #
###################
[NDB_MGMD DEFAULT]
LogDestination=FILE:filename=/var/log/cluster-log,maxsize=536870912,maxfiles=4
DataDir=/var/lib/mysql-cluster

[NDB_MGMD]
HostName=<ipaddress>
Id=1
ArbitrationRank=1

[NDB_MGMD]
HostName=<ipaddress>
Id=2
ArbitrationRank=2

#############
# SQL Nodes #
#############
# Id's 5 through 43 are reserved for SQL Nodes
# Should be way more than we will ever need.
[MYSQLD]
HostName=<ipaddress>
Id=3

[MYSQLD]
HostName=<ipaddress>
Id=4

[MYSQLD]
HostName=<ipaddress>
Id=5

[MYSQLD]
HostName=<ipaddress>
Id=6

[MYSQLD]
HostName=<ipaddress>
Id=7

[MYSQLD]
HostName=<ipaddress>
Id=8

[MYSQLD]
HostName=<ipaddress>
Id=9

[MYSQLD]
HostName=<ipaddress>
Id=10

##############################################
# These are reserved for backup/restore jobs #
# from the 2 management nodes                #
##############################################

[MYSQLD]
HostName=<ipaddress>
Id=42

[MYSQLD]
HostName=<ipaddress>
Id=43

#########################
# TCP Defaults for NDBD #
#########################
[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=1M

###################################
# NDBD (Data Nodes) Configuration #
###################################
# Each node has 16GB of RAM
[NDBD DEFAULT]
# Allocate 70% to Data
DataMemory=11468MB

# Allocate 15% to Primary Keys
IndexMemory=2457MB

# Keep 1 redundant copy of the data
NoOfReplicas=2

# Set DataDir, this dir is default but I like to set it anyway.
DataDir=/var/lib/mysql-cluster

# Other Tunings.  These are where the real magic happens.
SchedulerExecutionTimer=50
LockMaintThreadsToCPU=0
LockPagesInMainMemory=0
StopOnError=0
ODirect=1
MaxNoOfConcurrentOperations=16384
MaxNoOfOrderedIndexes=512
MaxNoOfUniqueHashIndexes=256
MaxNoOfTables=128
MaxNoOfAttributes=2048
CompressedBackup=1

[NDBD]
HostName=<ipaddress>
Id=44

[NDBD]
HostName=<ipaddress>
Id=45

[NDBD]
HostName=<ipaddress>
Id=46

[NDBD]
HostName=<ipaddress>
Id=47

The ndbd process is randomly giving this error on the management console:

Node 45: Forced node shutdown completed, restarting. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

This is happening for all 4 data nodes.  I have run memory tests on all servers and everything checks out.

Here is the error log from one of the nodes:

Current byte-offset of file-pointer is: 1067                      

Time: Tuesday 25 November 2008 - 13:30:02
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: main.cpp
Program: ndbd
Pid: 6157
Trace: /var/lib/mysql-cluster/ndb_44_trace.log.1
Version: mysql-5.1.27 ndb-6.3.17-RC
***EOM***
                                                                                                 
Time: Tuesday 25 November 2008 - 13:50:02
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: main.cpp
Program: ndbd
Pid: 6203
Trace: /var/lib/mysql-cluster/ndb_44_trace.log.2
Version: mysql-5.1.27 ndb-6.3.17-RC
***EOM***
                                                                       

Attached are the traces.

How to repeat:
startup a 4 ndbd cluster with that config running the latest ndbd packages from your site.

Suggested fix:
none

I wanted to qualify, I am running these packages across my cluster:

MySQL-Cluster-gpl-client-6.3.17-0.rhel5.x86_64.rpm
MySQL-Cluster-gpl-devel-6.3.17-0.rhel5.x86_64.rpm
MySQL-Cluster-gpl-extra-6.3.17-0.rhel5.x86_64.rpm
MySQL-Cluster-gpl-management-6.3.17-0.rhel5.x86_64.rpm
MySQL-Cluster-gpl-server-6.3.17-0.rhel5.x86_64.rpm
MySQL-Cluster-gpl-shared-6.3.17-0.rhel5.x86_64.rpm
MySQL-Cluster-gpl-storage-6.3.17-0.rhel5.x86_64.rpm
MySQL-Cluster-gpl-tools-6.3.17-0.rhel5.x86_64.rpm

Thanks to the masterful work of Matthew Montgomery, we backtraced this to backups.  I had zeroed my cluster and was preparing the import when I saw the crashes.  Everything linked back to a 5 minute hot backup script that was running in cron.  The master node at the time the backup job was kicked off would die and restart.  This appears to not be a problem with ndbd but with the online backup.  I am moving the severity to s3 as it appears to work fine when you have table data in ndb.  Online backup blows up the management node kicking off the backup and master(oldest) node if you don't have any table data in the cluster.

in the spirit of retesting I had created a test table and verified that backups run.  I then dropped the ndb table and backups still run.  The only other thing that happened was that these ndbd nodes were all started with --initial and no data was imported.  Also I am using a custom backup id in the format of MMDDHHmm, not sure if that has anything to do with it.

it appears this only happens when backing up an empty cluster and using a custom backup id with the start backup command.

so the particular case that causes this is if you have a freshly initialized cluster and run START BACKUP <ID> on your management node where ID is a custom backup id then the cluster master node dies.

To see this error you must execute a backup on a completely clean cluster (--initial) with the $datadir/BACKUPS completely empty.  The START BACKUP also has to include an explicitly defined <backup id>.

ndb_mgm> START BACKUP 1

No other backups issued by "START BACKUP" alone should be done before hand.

Workaround: Simple. Run regular "START BACKUP" first before any START BACKUP <backup id>, or ensure that at least 1 user defined table exists in the cluster.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/66803

2862 Jonas Oreland	2009-02-18
      ndb - bug#41031 - incorrect handling of start backup <id>

Pushed into 5.1.32-ndb-6.4.3 (revid:jonas@mysql.com-20090218220511-bgnaexvwjjfq2g6w) (version source revid:jonas@mysql.com-20090218205319-9bapz34b4uam3uno) (merge vers: 5.1.32-ndb-6.4.3) (pib:6)

Pushed into 5.1.32-ndb-6.3.23 (revid:jonas@mysql.com-20090218220353-ih9lxz0jg5od9k2c) (version source revid:jonas@mysql.com-20090218205235-emzevgpji2jb2gwf) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)

Pushed into 5.1.32-ndb-6.3.23 (revid:tomas.ulin@sun.com-20090219064350-7jj9hsvvbgsp88g5) (version source revid:tomas.ulin@sun.com-20090219064350-7jj9hsvvbgsp88g5) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)

Pushed into 5.1.32-ndb-6.3.23 (revid:tomas.ulin@sun.com-20090219070811-p36a79y85qfv5vsz) (version source revid:tomas.ulin@sun.com-20090219070811-p36a79y85qfv5vsz) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)

Documented bugfix in the NDB-6.2.17, 6.3.23, and 6.4.3 changelogs as follows:

        Given a MySQL Cluster containing no data (that is, whose data
        nodes had all been started using --initial, and into which no
        data had yet been imported) and having an empty backup
        directory, executing START BACKUP with a user-specified backup
        ID caused the data nodes to crash.

Pushed into 5.1.32-ndb-6.4.3 (revid:jonas@mysql.com-20090219103836-vz65tl5a9n7rji1h) (version source revid:jonas@mysql.com-20090219103836-vz65tl5a9n7rji1h) (merge vers: 5.1.32-ndb-6.4.3) (pib:6)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/66877

2866 Tomas Ulin	2009-02-19
      remove sleep and add comment after bug#41031 was fixed
      modified:
        mysql-test/suite/ndb_team/t/ndb_autodiscover3.test

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/66904

2865 Tomas Ulin	2009-02-19
      remove sleep and add comment after bug#41031 was fixed
      modified:
        mysql-test/suite/ndb_team/t/ndb_autodiscover3.test