| Bug #41031 | All ndbd nodes crash on backup failure when giving <backup id> manualy | ||
|---|---|---|---|
| Submitted: | 25 Nov 2008 19:58 | Modified: | 19 Feb 2009 8:04 |
| Reporter: | Daniel Salinas | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
| Version: | ndb-6.3.17 ndb-6.4.0 | OS: | Linux (rhel5.2) |
| Assigned to: | Jonas Oreland | CPU Architecture: | Any |
| Tags: | ndb cluster segfault | ||
[25 Nov 2008 20:18]
Daniel Salinas
I wanted to qualify, I am running these packages across my cluster: MySQL-Cluster-gpl-client-6.3.17-0.rhel5.x86_64.rpm MySQL-Cluster-gpl-devel-6.3.17-0.rhel5.x86_64.rpm MySQL-Cluster-gpl-extra-6.3.17-0.rhel5.x86_64.rpm MySQL-Cluster-gpl-management-6.3.17-0.rhel5.x86_64.rpm MySQL-Cluster-gpl-server-6.3.17-0.rhel5.x86_64.rpm MySQL-Cluster-gpl-shared-6.3.17-0.rhel5.x86_64.rpm MySQL-Cluster-gpl-storage-6.3.17-0.rhel5.x86_64.rpm MySQL-Cluster-gpl-tools-6.3.17-0.rhel5.x86_64.rpm
[25 Nov 2008 21:13]
Daniel Salinas
Thanks to the masterful work of Matthew Montgomery, we backtraced this to backups. I had zeroed my cluster and was preparing the import when I saw the crashes. Everything linked back to a 5 minute hot backup script that was running in cron. The master node at the time the backup job was kicked off would die and restart. This appears to not be a problem with ndbd but with the online backup. I am moving the severity to s3 as it appears to work fine when you have table data in ndb. Online backup blows up the management node kicking off the backup and master(oldest) node if you don't have any table data in the cluster.
[25 Nov 2008 21:38]
Daniel Salinas
in the spirit of retesting I had created a test table and verified that backups run. I then dropped the ndb table and backups still run. The only other thing that happened was that these ndbd nodes were all started with --initial and no data was imported. Also I am using a custom backup id in the format of MMDDHHmm, not sure if that has anything to do with it.
[25 Nov 2008 21:54]
Daniel Salinas
it appears this only happens when backing up an empty cluster and using a custom backup id with the start backup command.
[25 Nov 2008 22:00]
Daniel Salinas
so the particular case that causes this is if you have a freshly initialized cluster and run START BACKUP <ID> on your management node where ID is a custom backup id then the cluster master node dies.
[25 Nov 2008 22:07]
MySQL Verification Team
To see this error you must execute a backup on a completely clean cluster (--initial) with the $datadir/BACKUPS completely empty. The START BACKUP also has to include an explicitly defined <backup id>. ndb_mgm> START BACKUP 1 No other backups issued by "START BACKUP" alone should be done before hand. Workaround: Simple. Run regular "START BACKUP" first before any START BACKUP <backup id>, or ensure that at least 1 user defined table exists in the cluster.
[18 Feb 2009 20:53]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/66803 2862 Jonas Oreland 2009-02-18 ndb - bug#41031 - incorrect handling of start backup <id>
[18 Feb 2009 22:06]
Bugs System
Pushed into 5.1.32-ndb-6.4.3 (revid:jonas@mysql.com-20090218220511-bgnaexvwjjfq2g6w) (version source revid:jonas@mysql.com-20090218205319-9bapz34b4uam3uno) (merge vers: 5.1.32-ndb-6.4.3) (pib:6)
[18 Feb 2009 22:08]
Bugs System
Pushed into 5.1.32-ndb-6.3.23 (revid:jonas@mysql.com-20090218220353-ih9lxz0jg5od9k2c) (version source revid:jonas@mysql.com-20090218205235-emzevgpji2jb2gwf) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)
[19 Feb 2009 6:44]
Bugs System
Pushed into 5.1.32-ndb-6.3.23 (revid:tomas.ulin@sun.com-20090219064350-7jj9hsvvbgsp88g5) (version source revid:tomas.ulin@sun.com-20090219064350-7jj9hsvvbgsp88g5) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)
[19 Feb 2009 7:08]
Bugs System
Pushed into 5.1.32-ndb-6.3.23 (revid:tomas.ulin@sun.com-20090219070811-p36a79y85qfv5vsz) (version source revid:tomas.ulin@sun.com-20090219070811-p36a79y85qfv5vsz) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)
[19 Feb 2009 8:04]
Jon Stephens
Documented bugfix in the NDB-6.2.17, 6.3.23, and 6.4.3 changelogs as follows:
Given a MySQL Cluster containing no data (that is, whose data
nodes had all been started using --initial, and into which no
data had yet been imported) and having an empty backup
directory, executing START BACKUP with a user-specified backup
ID caused the data nodes to crash.
[19 Feb 2009 10:40]
Bugs System
Pushed into 5.1.32-ndb-6.4.3 (revid:jonas@mysql.com-20090219103836-vz65tl5a9n7rji1h) (version source revid:jonas@mysql.com-20090219103836-vz65tl5a9n7rji1h) (merge vers: 5.1.32-ndb-6.4.3) (pib:6)
[19 Feb 2009 10:44]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/66877 2866 Tomas Ulin 2009-02-19 remove sleep and add comment after bug#41031 was fixed modified: mysql-test/suite/ndb_team/t/ndb_autodiscover3.test
[19 Feb 2009 13:03]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/66904 2865 Tomas Ulin 2009-02-19 remove sleep and add comment after bug#41031 was fixed modified: mysql-test/suite/ndb_team/t/ndb_autodiscover3.test

Description: I have created a cluster consisting of 8 total nodes, they break down as follows: 4 x Ndbd 2 x Ndb_mgmd 2 x Mysql This cluster uses the following config: ################### # Mangement nodes # ################### [NDB_MGMD DEFAULT] LogDestination=FILE:filename=/var/log/cluster-log,maxsize=536870912,maxfiles=4 DataDir=/var/lib/mysql-cluster [NDB_MGMD] HostName=<ipaddress> Id=1 ArbitrationRank=1 [NDB_MGMD] HostName=<ipaddress> Id=2 ArbitrationRank=2 ############# # SQL Nodes # ############# # Id's 5 through 43 are reserved for SQL Nodes # Should be way more than we will ever need. [MYSQLD] HostName=<ipaddress> Id=3 [MYSQLD] HostName=<ipaddress> Id=4 [MYSQLD] HostName=<ipaddress> Id=5 [MYSQLD] HostName=<ipaddress> Id=6 [MYSQLD] HostName=<ipaddress> Id=7 [MYSQLD] HostName=<ipaddress> Id=8 [MYSQLD] HostName=<ipaddress> Id=9 [MYSQLD] HostName=<ipaddress> Id=10 ############################################## # These are reserved for backup/restore jobs # # from the 2 management nodes # ############################################## [MYSQLD] HostName=<ipaddress> Id=42 [MYSQLD] HostName=<ipaddress> Id=43 ######################### # TCP Defaults for NDBD # ######################### [TCP DEFAULT] SendBufferMemory=2M ReceiveBufferMemory=1M ################################### # NDBD (Data Nodes) Configuration # ################################### # Each node has 16GB of RAM [NDBD DEFAULT] # Allocate 70% to Data DataMemory=11468MB # Allocate 15% to Primary Keys IndexMemory=2457MB # Keep 1 redundant copy of the data NoOfReplicas=2 # Set DataDir, this dir is default but I like to set it anyway. DataDir=/var/lib/mysql-cluster # Other Tunings. These are where the real magic happens. SchedulerExecutionTimer=50 LockMaintThreadsToCPU=0 LockPagesInMainMemory=0 StopOnError=0 ODirect=1 MaxNoOfConcurrentOperations=16384 MaxNoOfOrderedIndexes=512 MaxNoOfUniqueHashIndexes=256 MaxNoOfTables=128 MaxNoOfAttributes=2048 CompressedBackup=1 [NDBD] HostName=<ipaddress> Id=44 [NDBD] HostName=<ipaddress> Id=45 [NDBD] HostName=<ipaddress> Id=46 [NDBD] HostName=<ipaddress> Id=47 The ndbd process is randomly giving this error on the management console: Node 45: Forced node shutdown completed, restarting. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. This is happening for all 4 data nodes. I have run memory tests on all servers and everything checks out. Here is the error log from one of the nodes: Current byte-offset of file-pointer is: 1067 Time: Tuesday 25 November 2008 - 13:30:02 Status: Temporary error, restart node Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug) Error: 6000 Error data: Signal 11 received; Segmentation fault Error object: main.cpp Program: ndbd Pid: 6157 Trace: /var/lib/mysql-cluster/ndb_44_trace.log.1 Version: mysql-5.1.27 ndb-6.3.17-RC ***EOM*** Time: Tuesday 25 November 2008 - 13:50:02 Status: Temporary error, restart node Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug) Error: 6000 Error data: Signal 11 received; Segmentation fault Error object: main.cpp Program: ndbd Pid: 6203 Trace: /var/lib/mysql-cluster/ndb_44_trace.log.2 Version: mysql-5.1.27 ndb-6.3.17-RC ***EOM*** Attached are the traces. How to repeat: startup a 4 ndbd cluster with that config running the latest ndbd packages from your site. Suggested fix: none