Bug #21271 2 Node cluster can't recover from bringing one node down
Submitted: 25 Jul 2006 11:07 Modified: 1 Sep 2006 15:11
Reporter: Jan-willem van Eys Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.11 OS:Linux (RHEL-ES4 (Nahant))
Assigned to: Jonas Oreland CPU Architecture:Any

[25 Jul 2006 11:07] Jan-willem van Eys
Description:
After loading a cluster setup with 2 data nodes with our test database (268 tables, 180MB text dump), I brought the second node down from withing ndb_mgm (by issuing '3 stop')
This works, except that ndb_mgm loses its connection, and needs to be run again.

Starting the ndbd process on the server causes the following error:

2006-07-25 12:38:19 [ndbd] INFO     -- Error handler startup shutting down system
2006-07-25 12:38:19 [ndbd] INFO     -- Error handler shutdown completed - exiting
2006-07-25 12:38:19 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2006-07-25 12:38:19 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

The error log reports:

Current byte-offset of file-pointer is: 568

Time: Tuesday 25 July 2006 - 12:38:19
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: restore.cpp
Error object: RESTORE (Line: 1150) 0x0000000a
Program: ndbd
Pid: 29331
Trace: /mnt/lvol1/mysql-cluster/ndb_3_trace.log.1
Version: Version 5.1.11 (beta)
***EOM***

Restarting doesn't work, it just causes the same error.

The commands used to set up a logfile group:
CREATE LOGFILE GROUP lg_1
ADD UNDOFILE 'undo_1.dat'
INITIAL_SIZE 128M
ENGINE NDB;

ALTER LOGFILE GROUP lg_1
ADD UNDOFILE 'undo_2.dat'
INITIAL_SIZE 128M
ENGINE NDB;

CREATE TABLESPACE ts_1
ADD DATAFILE 'data_1.dat'
USE LOGFILE GROUP lg_1
INITIAL_SIZE 1024M
ENGINE NDB;

ALTER TABLESPACE ts_1
ADD DATAFILE 'data_2.dat'
INITIAL_SIZE 1024M
ENGINE NDB;

The config.ini:

[NDBD DEFAULT]    
NoOfReplicas=2
DataMemory=1024M
IndexMemory=768M
MaxNoOfOrderedIndexes=800
MaxNoOfAttributes=8000
MaxNoOfTables=600

[TCP DEFAULT]     
portnumber=2202

[NDB_MGMD]                      
hostname=192.168.100.4          # Grijs is management node
datadir=/var/lib/mysql-cluster

# Test2 - Data Node
[NDBD]                          
hostname=192.168.100.2          # Test2
datadir=/mysql

# Test3 - Data Node
[NDBD]                          
hostname=192.168.100.3          # Test3
datadir=/mnt/lvol1/mysql-cluster

# SQL node options:
[MYSQLD]                        
hostname=192.168.100.3          # Test3 heeft een mysqld
[MYSQLD]
hostname=192.168.100.1          # Test1 heeft een mysqld
[MYSQLD]
[MYSQLD]
[MYSQLD]

How to repeat:
Build a 4 machine cluster with MySQL v5.1.11, 2 data nodes, 1 mgm node, 1 mysqld node.

- Set the database up with disk-based storage.
- Create all tables (by running the (modified for ndb with disk storage) .sql files from a mysqldump)
- Fill all tables (by running mysqlimport)
- issue '3 stop' from the ndb_mgm console
- run 'ndbd' on the datanode you brought down
- wait.
[25 Jul 2006 11:25] Miguel Solorzano
Changing for adequate Category: Cluster.
[1 Aug 2006 14:24] Jonas Oreland
Can you please upload tracefile aswell?
[2 Aug 2006 8:00] Jan-willem van Eys
tarball with error and trace logs

Attachment: cluster_crash_logs.tgz (application/x-compressed-tar, text), 38.80 KiB.

[2 Aug 2006 8:01] Jan-willem van Eys
trace logs from the crashing node added
[2 Aug 2006 8:21] Jonas Oreland
Thx...

I _think_ i know what the problem is...

Note for myself: Two fragments should be restored using different LCP-no
                 But I incorrectly them in same file, making LCP restore...
[2 Aug 2006 8:25] Jonas Oreland
As a work-around you should be able to start the node "--initial"
[3 Aug 2006 13:33] Jonas Oreland
Hi

1) Would it be possible to get access to the filesystem causing this.
2) What hardware do you use
   What linux kernel is used in RHEL-ES4 (Nahant) (try "uname -a")

/Jonas
[7 Aug 2006 12:29] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10115

ChangeSet@1.2263, 2006-08-07 14:28:58+02:00, jonas@perch.ndb.mysql.com +10 -0
  ndb - bug#21271
    make each fragment use own LCP file, so that (s/n)r can use different LCP-no for different fragments
[7 Aug 2006 12:31] Jonas Oreland
NOTE TO JON: This patch makes filesystem incompatible with earlier version
  (i.e a noderestart --initial is needed during upgrade)
[1 Sep 2006 8:07] Jonas Oreland
pushed to 5.1.12
[1 Sep 2006 15:11] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented incompatible change in 5.1.12 changelog and Cluster Upgrades/Downgrades section of Manual.