Bug #48232 Crash in DBDICT (Line: 4115)
Submitted: 22 Oct 2009 14:13 Modified: 27 Oct 2009 6:43
Reporter: Andy Lintner Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-7.0 OS:Linux (RHEL 5.4)
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: 7.0.8a

[22 Oct 2009 14:13] Andy Lintner
Description:
When restarting a data node after a GCP stop, I experienced the following error, and am now unable to start the node. The same error occurs when doing an initial start.

Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 4115) 0x0000000e
Program: /usr/local/mysql//mysql/bin//ndbmtd
Pid: 2578 thr: 0
Version: mysql-5.1.37 ndb-7.0.8

How to repeat:
unknown
[22 Oct 2009 14:13] Andy Lintner
Trace files from the crash

Attachment: ndb_4_logs.tar.gz (application/x-gzip, text), 129.53 KiB.

[22 Oct 2009 14:14] Andy Lintner
config.ini

Attachment: config.ini (text/plain), 4.59 KiB.

[22 Oct 2009 14:26] Jonas Oreland
cluster log would also be good
(note: i havent actually checked traces yet...but cluster log is always
 good to have around)
[22 Oct 2009 14:38] Andy Lintner
Cluster Log

Attachment: ndb_1_cluster.log (application/octet-stream, text), 110.74 KiB.

[22 Oct 2009 16:14] Jonas Oreland
The problem seems to be that the alive node has bigger SharedGlobalMemory than
the starting node.

My guess is that you
1) Started cluster with a value for SharedGlobalMemory
2) changed the value
3) restarted this node with a lower value

Not entirely sure though, but pretty sure that setting that value
restarting the "ndb_mgmd --reload" and then start the problematic node
will make problem go away.
[22 Oct 2009 17:42] Andy Lintner
I restarted both management nodes, followed by the active node, and then the inactive node experienced the same fault. However, your comment on memory made me dig deeper, and I discovered an unrelated runaway process consuming memory on that server. There were only 2G available to the the node, instead of the normal 8G. Killing that process allowed the node to startup.

However, the error message was obviously less than helpful. Since your diagnosis indicated a mismatching SharedGlobalMemory, is is there anything that would have dynamically resized SharedGlobalMemory down in response to insufficient available memory? Either way, my issue is resolved, so I moved this down to Non-critical since it seems to just be an issue of error reporting.
[26 Oct 2009 14:41] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/88181

3163 Jonas Oreland	2009-10-26
      ndb - bug#48232 - improve error reporting when failure to recreate/drop object during restore of schema
[26 Oct 2009 14:42] Jonas Oreland
Added informative error message
Pushed to 7.0.9
[27 Oct 2009 6:43] Jon Stephens
Bugfix documented in the NDB-7.0.9 changelog as follows:

        When a data node failed to start due to inability to recreate or
        drop objects during schema restoration (for example:
        insufficient memory was available to the data node process on
        account of issues not directly related to MySQL Cluster on the
        host machine), the reason for the failure was not provided. Now
        is such cases, a more informative error message is logged.

Closed.