Bug #48643 NDBD node crashes and refuses integrate on restart
Submitted: 9 Nov 2009 16:49 Modified: 10 Nov 2009 12:22
Reporter: theblue Sage Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S1 (Critical)
Version:7.0.7 OS:Linux (CentOS 5.3_64)
Assigned to: Assigned Account CPU Architecture:Any
Tags: ndbd errror. NDBD non start

[9 Nov 2009 16:49] theblue Sage
Description:
Cluster with 2 data, 2 sql and 1 management node. Ran fine for about 2 days, then a data node crashed. From that point on the data node refuses to start again.
The error message in ndb_4_out.log showed this :
2009-11-08 14:01:04 [ndbd] INFO     -- dblqh/DblqhMain.cpp
2009-11-08 14:01:04 [ndbd] INFO     -- DBLQH (Line: 7661) 0x0000000e
2009-11-08 14:01:04 [ndbd] INFO     -- Error handler shutting down system
2009-11-08 14:01:04 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-11-08 14:01:04 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequir
e)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

however there was _nothing_ in the ndb_4_error.log file. 

When I try to restart the data node, it refuses to start. After quite a long time (longer that I would expect it to take to sync up), I get this error message in ndb_4_out.log :

2009-11-09 10:36:05 [ndbd] INFO     -- dblqh/DblqhMain.cpp
2009-11-09 10:36:05 [ndbd] INFO     -- DBLQH (Line: 7661) 0x0000000a
2009-11-09 10:36:05 [ndbd] INFO     -- Error handler startup shutting down system
2009-11-09 10:36:05 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-11-09 10:36:05 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2009-11-09 10:36:05 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal pr
ogram error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart nod
e'.

preceding this in the logs are a lot of lines saying this :
2009-11-09 10:36:04 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=7605
2009-11-09 10:36:04 [ndbd] INFO     -- Watchdog: User time: 33545  System time: 4099
2009-11-09 10:36:04 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=7705
2009-11-09 10:36:04 [ndbd] INFO     -- Watchdog: User time: 33545  System time: 4099
2009-11-09 10:36:05 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=7805
2009-11-09 10:36:05 [ndbd] INFO     -- Watchdog: User time: 33545  System time: 4099
2009-11-09 10:36:05 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=7906
2009-11-09 10:36:05 [ndbd] INFO     -- Watchdog: User time: 33545  System time: 4099
2009-11-09 10:36:05 [ndbd] WARNING  -- Ndb kernel thread 0 is stuck in: Print Job Buffers at crash elapsed=8006

How to repeat:
try to start the data node with 'ndbd -v' after it has crashed. It refuses to start.

Suggested fix:
data node should start and sync correctly when added to the cluster.
[9 Nov 2009 16:50] theblue Sage
Error logs from ndb_error_reporter

Attachment: ndb_error_report_20091109113559.tar.bz2 (application/x-bzip, text), 300.99 KiB.

[9 Nov 2009 16:50] theblue Sage
Eror files now attached , taken from ndb_Error_reporter.
[9 Nov 2009 21:25] theblue Sage
attempting to start the node without using '--initial' but first I wiped out all the disk data stored in '/var/lib/mysql/ndb_4_fs'. This failed after a two hour sync up.
Something I noticed: I was expecting the data usage of the newer syncing node (nodeid-5) to come up to and then match the data usage of the good node. However as you can see from the log snippet below, the data usage went above the original before the node shut itself down and gave up.

Node 5: Index usage is 3%(5708 8K pages of total 173856)
Node 4: Data usage is 6%(21899 32K pages of total 347520)
Node 5: Data usage is 6%(21761 32K pages of total 347520)
Node 5: Index usage is 3%(5709 8K pages of total 173856)
Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
[9 Nov 2009 21:32] theblue Sage
new error logs after failure to restart with a wipe of all disk data ...

Attachment: ndb_error_report_20091109162716.tar.bz2 (application/x-bzip, text), 253.11 KiB.

[10 Nov 2009 1:06] theblue Sage
okay, so I blew away all the log files, and the data directory, then started the node with :

ndbd --initial -v

and the node failed to start with the same error. I have attached the new error logs ...
[10 Nov 2009 1:06] theblue Sage
ndb_error_reporter for third attempt (--initial used)

Attachment: ndb_error_report_20091109200234.tar.bz2 (application/x-bzip, text), 283.73 KiB.

[10 Nov 2009 12:22] Jonas Oreland
This is a duplicate of http://bugs.mysql.com/bug.php?id=47935
Which is fixed in 7.0.9