MySQL Bugs: #37985: Forced node shutdown completed. Initiated by signal 11 (Caused by error 2308)

Bug #37985	Forced node shutdown completed. Initiated by signal 11 (Caused by error 2308)
Submitted:	9 Jul 2008 9:52	Modified:	17 Nov 2008 16:35
Reporter:	Hans-Christian Andersen	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	Mysql-cluster GPL 6.2.15	OS:	Linux (Debian Etch 4.0 r3 AMD64)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	cluster, shutdown, Signal 11

Description:
I have the following setup (load balanced system).

1xCluster manager
- Mysql-cluster GPL 6.2.15 Linux i686 glibc23 (MySQL's binaries)
- Debian Sarge
- Ultramonkey

2xNDB Nodes+API
- Mysql-cluster GPL 6.2.15 Linux x86_64 glibc23 (MySQL's binaries)
- Debian Etch 4.0r3 AMD64

I set up the whole rig and it runs fine. I configured the cluster management node with the following ( /var/lib/mysql-cluster/config.ini ):

===================================
[NDBD DEFAULT]
NoOfReplicas=2
# added for test
DataMemory=7200MB
IndexMemory=300MB
RedoBuffer=12MB
NoOfFragmentLogFiles=300
MaxNoOfOrderedIndexes = 512
MaxNoOfAttributes = 5000
MaxNoOfTables = 512

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]

[TCP DEFAULT]
# Section for the cluster management node
[NDB_MGMD]
# IP address of the management node (this system)
HostName=192.168.3.5

# Section for the storage nodes
[NDBD]
# IP address of the first storage node
HostName=192.168.3.40
DataDir= /var/lib/mysql-cluster
BackupDataDir=/home/backups

[NDBD]
# IP address of the second storage node
HostName=192.168.3.41
DataDir=/var/lib/mysql-cluster
BackupDataDir=/home/backups

# one [MYSQLD] per storage node
[MYSQLD]
[MYSQLD]
===================================

We have a rather large database and we kitted out cluster nodes with 12 Gb of RAM. Hence, I upped the data and index memory to use lots of memory.

I imported our old database, roughly 5 gigs in size, and I successfully altered their tables to use the ndbcluster engine. Everything was working fine, until we needed to restart our cluster manager. I entered "ndb_mgm" and ran "shutdown;". Once the nodes shutdown, I rebooted the server and when it came up again, I found it would not start.

Running "ndbd" on both nodes starts the process, as you can see on the output of the cluster manager below:

================================
ndb_mgm> show;
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=2 @192.168.3.40 (mysql-5.1.23 ndb-6.2.15, starting, Nodegroup: 0, Master)
id=3 @192.168.3.41 (mysql-5.1.23 ndb-6.2.15, starting, Nodegroup: 0)

[ndb_mgmd(MGM)] 1 node(s)
id=1 (mysql-5.1.23 ndb-6.2.15)

[mysqld(API)] 2 node(s)
id=4 (not connected, accepting connect from any host)
id=5 (not connected, accepting connect from any host)
================================

And I can see in top that ndbd starts bringing up the database as the system memory usage is increasing. But after a little bit, suddenly memory drops back down to normal and the cluster manager outputs the following (vague) error:

================================
ndb_mgm> all status;
Node 2: starting (Phase 3) (mysql-5.1.23 ndb-6.2.15)
Node 3: starting (Phase 3) (mysql-5.1.23 ndb-6.2.15)

ndb_mgm> Node 3: Forced node shutdown completed. Initiated by signal 11.
Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.

ndb_mgm> all status;
Node 2: not connected
Node 3: not connected
================================

The strange thing is, it would appear that node 3 is failing, but "ndb_3_error.log" doesn't report anything, while "ndb_2_error.log" reports:

================================
Time: Tuesday 8 July 2008 - 12:11:25
Status: Temporary error, restart node
Message: Another node failed during system restart, please investigate error(s) on other node(s) (Restart error)
Error: 2308
Error data: Node 3 disconnected
Error object: QMGR (Line: 2681) 0x0000000a
Program: ndbd
Pid: 952
Trace: /var/lib/mysql-cluster/ndb_2_trace.log.12
Version: mysql-5.1.23 ndb-6.2.15
***EOM***
================================

"ndb_3_out.log" shows that it was busy restoring tables (successfully) before it simply says "2008-07-08 10:17:25 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 11."

"ndb_2_out.log" shows the same restoring list and then reports:

================================
2008-07-08 12:11:25 [ndbd] INFO -- Node 3 disconnected
2008-07-08 12:11:25 [ndbd] INFO -- QMGR (Line: 2681) 0x0000000a
2008-07-08 12:11:25 [ndbd] INFO -- Error handler startup shutting down system
2008-07-08 12:11:25 [ndbd] INFO -- Error handler shutdown completed - exiting
2008-07-08 12:11:25 [ndbd] INFO -- Angel received ndbd startup failure count 1.
2008-07-08 12:11:25 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
================================

What could possibly be going on? Node 3 is failing, yet it fails to report any errors.

I will include error and traces in a moment.

How to repeat:
Install Mysql-cluster GPL 6.2.15 Linux x86_64 glibc23 (MySQL's binaries) on 2 cluster nodes. Import 5 Gb database and convert all tables to NDBCLUSTER engine.

Shutdown nodes from ndb_mgm using "shutdown;" command and reboot NDB management server. Restart "ndb_mgmd" and execute "ndbd" on both cluster nodes.

ndb_2_error.log

Attachment: ndb_2_error.log (application/octet-stream, text), 1.67 KiB.

ndb_2_out.log

Attachment: ndb_2_out.log (application/octet-stream, text), 157.98 KiB.

ndb_2_trace.log.12

Attachment: ndb_2_trace.log.12.gz (application/x-gzip, text), 29.87 KiB.

ndb_3_error.log

Attachment: ndb_3_error.log (text/x-log), 1.18 KiB.

ndb_3_out.log

Attachment: ndb_3_out.log (text/x-log), 111.27 KiB.

ndb_3_trace.log.9

Attachment: ndb_3_trace.log.9.gz (application/x-gzip, text), 29.98 KiB.

Hi,
  Thanks for the comprehensive bug report, but I think you need to send more infomration.  The trace files sent are for node shutdowns due to 'some other node failed during startup'.  I would expect that when some node suffers from signal 11 (SEGV), we would get an entry in its error log, and an associated trace file.  If this is not happening for some reason then perhaps the ndbd process can be run with the --core-file option, and a core file can be obtained?  Getting a stack traceback from the core file would be useful for understanding what is going on.

Looking at the log files you sent, there seem to be a lot of shutdown and restarts and file corruption is reported by both nodes at various points.  Can you give a fuller description of the actions which have been recorded in the log files?

Have you managed to get past this bug?  Have you been able to reproduce it?  Any further feedback appreciated.
Thanks,
Frazer

Interesting entries in  ndb_3_out.log: 

2008-07-07 11:14:48 [ndbd] INFO     -- Error 0 during restore of  2/T6F0
2008-07-07 11:14:48 [ndbd] INFO     -- RESTORE (Line: 1173) 0x0000000a
2008-07-07 11:14:48 [ndbd] INFO     -- Error handler startup shutting down system
2008-07-07 11:14:48 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-07-07 11:14:48 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2008-07-07 11:14:48 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2352: 'Invalid LCP(Ndbd file system
inconsistency error, please report a bug). Ndbd file system error, restart node
initial'.

...

2008-07-07 15:47:15 [ndbd] INFO     -- DBLQH: File system open failed. OS errno: 2
2008-07-07 15:47:15 [ndbd] INFO     -- DBLQH (Line: 1861) 0x0000000a
2008-07-07 15:47:15 [ndbd] INFO     -- Error handler startup shutting down system
2008-07-07 15:47:15 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-07-07 15:47:15 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2008-07-07 15:47:15 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2815: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.

Then multiple startup failures like this : 
RESTORE table: 250 21195 rows applied
2008-07-08 11:37:09 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 11.

In ndb_2_out.log I see : 

2008-07-04 00:42:57 [ndbd] INFO     -- Error opening DIH schema files for table: 10
2008-07-04 00:42:57 [ndbd] INFO     -- DBDIH (Line: 9509) 0x0000000a
2008-07-04 00:42:57 [ndbd] INFO     -- Error handler startup shutting down system
2008-07-04 00:42:58 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-07-04 00:42:58 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2008-07-04 00:42:58 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 2815: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.

...

2008-07-04 00:44:24 [ndbd] INFO     -- Error opening DIH schema files for table: 10
2008-07-04 00:44:24 [ndbd] INFO     -- DBDIH (Line: 9509) 0x0000000a
2008-07-04 00:44:24 [ndbd] INFO     -- Error handler startup shutting down system
2008-07-04 00:44:25 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-07-04 00:44:25 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2008-07-04 00:44:25 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 2815: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.

...

2008-07-04 10:37:32 [ndbd] INFO     -- /home/backups
2008-07-04 10:37:32 [ndbd] INFO     -- BackupDataDir
2008-07-04 10:37:32 [ndbd] INFO     -- Error handler shutting down system
2008-07-04 10:37:32 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-07-04 10:37:32 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 0. Caused by error 2805: 'Illegal file system path(Configuration error). Permanent error, external action needed'.

...

2008-07-07 17:41:15 [ndbd] INFO     -- DBLQH: File system open failed. OS errno: 2
2008-07-07 17:41:15 [ndbd] INFO     -- DBLQH (Line: 1861) 0x0000000a
2008-07-07 17:41:15 [ndbd] INFO     -- Error handler startup shutting down system
2008-07-07 17:41:15 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-07-07 17:41:15 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2008-07-07 17:41:15 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 2815: 'File not found(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Can reproduce this at will on ndbsup-1 with 7.0.11 on a single node cluster.

Occured during startphase 4. Caused by error 2352: 'Invalid LCP(Ndbd file system
inconsistency error, please report a bug). Ndbd file system error, restart node
initial'.

Could be related to index memory full.