MySQL Bugs: #48643: NDBD node crashes and refuses integrate on restart

Bug #48643	NDBD node crashes and refuses integrate on restart
Submitted:	9 Nov 2009 16:49	Modified:	10 Nov 2009 12:22
Reporter:	theblue Sage	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Disk Data	Severity:	S1 (Critical)
Version:	7.0.7	OS:	Linux (CentOS 5.3_64)
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	ndbd errror. NDBD non start

Description:
Cluster with 2 data, 2 sql and 1 management node. Ran fine for about 2 days, then a data node crashed. From that point on the data node refuses to start again.
The error message in ndb_4_out.log showed this :
2009-11-08 14:01:04 [ndbd] INFO     -- dblqh/DblqhMain.cpp
2009-11-08 14:01:04 [ndbd] INFO     -- DBLQH (Line: 7661) 0x0000000e
2009-11-08 14:01:04 [ndbd] INFO     -- Error handler shutting down system
2009-11-08 14:01:04 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-11-08 14:01:04 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequir
e)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

however there was _nothing_ in the ndb_4_error.log file. 

When I try to restart the data node, it refuses to start. After quite a long time (longer that I would expect it to take to sync up), I get this error message in ndb_4_out.log :

2009-11-09 10:36:05 [ndbd] INFO     -- dblqh/DblqhMain.cpp
2009-11-09 10:36:05 [ndbd] INFO     -- DBLQH (Line: 7661) 0x0000000a
2009-11-09 10:36:05 [ndbd] INFO     -- Error handler startup shutting down system
2009-11-09 10:36:05 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-11-09 10:36:05 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2009-11-09 10:36:05 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal pr
ogram error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart nod
e'.

preceding this in the logs are a lot of lines saying this :
2009-11-09 10:36:04 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=7605
2009-11-09 10:36:04 [ndbd] INFO     -- Watchdog: User time: 33545  System time: 4099
2009-11-09 10:36:04 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=7705
2009-11-09 10:36:04 [ndbd] INFO     -- Watchdog: User time: 33545  System time: 4099
2009-11-09 10:36:05 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=7805
2009-11-09 10:36:05 [ndbd] INFO     -- Watchdog: User time: 33545  System time: 4099
2009-11-09 10:36:05 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=7906
2009-11-09 10:36:05 [ndbd] INFO     -- Watchdog: User time: 33545  System time: 4099
2009-11-09 10:36:05 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=8006

How to repeat:
try to start the data node with 'ndbd -v' after it has crashed. It refuses to start.

Suggested fix:
data node should start and sync correctly when added to the cluster.

Error logs from ndb_error_reporter

Attachment: ndb_error_report_20091109113559.tar.bz2 (application/x-bzip, text), 300.99 KiB.

Eror files now attached , taken from ndb_Error_reporter.

attempting to start the node without using '--initial' but first I wiped out all the disk data stored in '/var/lib/mysql/ndb_4_fs'. This failed after a two hour sync up.
Something I noticed: I was expecting the data usage of the newer syncing node (nodeid-5) to come up to and then match the data usage of the good node. However as you can see from the log snippet below, the data usage went above the original before the node shut itself down and gave up.

Node 5: Index usage is 3%(5708 8K pages of total 173856)
Node 4: Data usage is 6%(21899 32K pages of total 347520)
Node 5: Data usage is 6%(21761 32K pages of total 347520)
Node 5: Index usage is 3%(5709 8K pages of total 173856)
Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

new error logs after failure to restart with a wipe of all disk data ...

Attachment: ndb_error_report_20091109162716.tar.bz2 (application/x-bzip, text), 253.11 KiB.

okay, so I blew away all the log files, and the data directory, then started the node with :

ndbd --initial -v

and the node failed to start with the same error. I have attached the new error logs ...

ndb_error_reporter for third attempt (--initial used)

Attachment: ndb_error_report_20091109200234.tar.bz2 (application/x-bzip, text), 283.73 KiB.

This is a duplicate of http://bugs.mysql.com/bug.php?id=47935
Which is fixed in 7.0.9