MySQL Bugs: #21271: 2 Node cluster can't recover from bringing one node down

Bug #21271	2 Node cluster can't recover from bringing one node down
Submitted:	25 Jul 2006 11:07	Modified:	1 Sep 2006 15:11
Reporter:	Jan-willem van Eys	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	5.1.11	OS:	Linux (RHEL-ES4 (Nahant))
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
After loading a cluster setup with 2 data nodes with our test database (268 tables, 180MB text dump), I brought the second node down from withing ndb_mgm (by issuing '3 stop')
This works, except that ndb_mgm loses its connection, and needs to be run again.

Starting the ndbd process on the server causes the following error:

2006-07-25 12:38:19 [ndbd] INFO     -- Error handler startup shutting down system
2006-07-25 12:38:19 [ndbd] INFO     -- Error handler shutdown completed - exiting
2006-07-25 12:38:19 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2006-07-25 12:38:19 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

The error log reports:

Current byte-offset of file-pointer is: 568

Time: Tuesday 25 July 2006 - 12:38:19
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: restore.cpp
Error object: RESTORE (Line: 1150) 0x0000000a
Program: ndbd
Pid: 29331
Trace: /mnt/lvol1/mysql-cluster/ndb_3_trace.log.1
Version: Version 5.1.11 (beta)
***EOM***

Restarting doesn't work, it just causes the same error.

The commands used to set up a logfile group:
CREATE LOGFILE GROUP lg_1
ADD UNDOFILE 'undo_1.dat'
INITIAL_SIZE 128M
ENGINE NDB;

ALTER LOGFILE GROUP lg_1
ADD UNDOFILE 'undo_2.dat'
INITIAL_SIZE 128M
ENGINE NDB;

CREATE TABLESPACE ts_1
ADD DATAFILE 'data_1.dat'
USE LOGFILE GROUP lg_1
INITIAL_SIZE 1024M
ENGINE NDB;

ALTER TABLESPACE ts_1
ADD DATAFILE 'data_2.dat'
INITIAL_SIZE 1024M
ENGINE NDB;

The config.ini:

[NDBD DEFAULT]    
NoOfReplicas=2
DataMemory=1024M
IndexMemory=768M
MaxNoOfOrderedIndexes=800
MaxNoOfAttributes=8000
MaxNoOfTables=600

[TCP DEFAULT]     
portnumber=2202

[NDB_MGMD]                      
hostname=192.168.100.4          # Grijs is management node
datadir=/var/lib/mysql-cluster

# Test2 - Data Node
[NDBD]                          
hostname=192.168.100.2          # Test2
datadir=/mysql

# Test3 - Data Node
[NDBD]                          
hostname=192.168.100.3          # Test3
datadir=/mnt/lvol1/mysql-cluster

# SQL node options:
[MYSQLD]                        
hostname=192.168.100.3          # Test3 heeft een mysqld
[MYSQLD]
hostname=192.168.100.1          # Test1 heeft een mysqld
[MYSQLD]
[MYSQLD]
[MYSQLD]

How to repeat:
Build a 4 machine cluster with MySQL v5.1.11, 2 data nodes, 1 mgm node, 1 mysqld node.

- Set the database up with disk-based storage.
- Create all tables (by running the (modified for ndb with disk storage) .sql files from a mysqldump)
- Fill all tables (by running mysqlimport)
- issue '3 stop' from the ndb_mgm console
- run 'ndbd' on the datanode you brought down
- wait.

Changing for adequate Category: Cluster.

Can you please upload tracefile aswell?

tarball with error and trace logs

Attachment: cluster_crash_logs.tgz (application/x-compressed-tar, text), 38.80 KiB.

trace logs from the crashing node added

Thx...

I _think_ i know what the problem is...

Note for myself: Two fragments should be restored using different LCP-no
                 But I incorrectly them in same file, making LCP restore...

As a work-around you should be able to start the node "--initial"

Hi

1) Would it be possible to get access to the filesystem causing this.
2) What hardware do you use
   What linux kernel is used in RHEL-ES4 (Nahant) (try "uname -a")

/Jonas

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10115

ChangeSet@1.2263, 2006-08-07 14:28:58+02:00, jonas@perch.ndb.mysql.com +10 -0
  ndb - bug#21271
    make each fragment use own LCP file, so that (s/n)r can use different LCP-no for different fragments

NOTE TO JON: This patch makes filesystem incompatible with earlier version
  (i.e a noderestart --initial is needed during upgrade)

pushed to 5.1.12

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented incompatible change in 5.1.12 changelog and Cluster Upgrades/Downgrades section of Manual.