Bug #18475 Out of memory in COPY_FRAG
Submitted: 23 Mar 2006 23:50 Modified: 6 Sep 2006 10:09
Reporter: Jim Conner Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.19-max-log with openssl OS:Linux (Fedora Core 4)
Assigned to: Jonas Oreland CPU Architecture:Any

[23 Mar 2006 23:50] Jim Conner
Description:
While restarting the cluster storage nodes (after bouncing mgmd on mgm host) I received an error indicating I should report this incident as a bug.

-- snip --

ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

2006-03-23 10:27:29 [ndbd] INFO     -- Angel pid: 13333 ndb pid: 13334
2006-03-23 10:27:29 [ndbd] INFO     -- NDB Cluster -- DB node 3
2006-03-23 10:27:29 [ndbd] INFO     -- Version 5.0.19 --
2006-03-23 10:27:29 [ndbd] INFO     -- Configuration fetched at tlbfc04mgm01 port 1186
2006-03-23 10:27:29 [ndbd] INFO     -- Start initiated (version 5.0.19)
Management server closed connection early. It is probably being shut down (or has crashed). We will retry the connection.
2006-03-23 14:23:59 [ndbd] INFO     -- Restarting system
2006-03-23 14:24:00 [ndbd] INFO     -- Node 3: Node shutdown completed, restarting.
2006-03-23 14:24:00 [ndbd] INFO     -- Ndb has terminated (pid 13334) restarting
2006-03-23 14:24:03 [ndbd] INFO     -- Angel pid: 13333 ndb pid: 13915
2006-03-23 14:24:03 [ndbd] INFO     -- NDB Cluster -- DB node 3
2006-03-23 14:24:03 [ndbd] INFO     -- Version 5.0.19 --
2006-03-23 14:24:03 [ndbd] INFO     -- Configuration fetched at tlbfc04mgm01 port 1186
2006-03-23 14:24:04 [ndbd] INFO     -- Start initiated (version 5.0.19)
2006-03-23 14:30:11 [ndbd] INFO     -- Error handler startup shutting down system
2006-03-23 14:30:12 [ndbd] INFO     -- Error handler shutdown completed - exiting
2006-03-23 14:30:12 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2006-03-23 14:30:12 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 4. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

--- snip ---

Current config.ini:

-- snip --

[NDBD DEFAULT]
NoOfReplicas                  = 3
DataMemory                    = 1700M
IndexMemory                   = 100M
datadir                       = /var/lib/mysql-cluster
MaxNoOfAttributes             = 15000
MaxNoOfOrderedIndexes         = 6000
MaxNoOfUniqueHashIndexes      = 6000
MaxNoOfConcurrentOperations   = 49152
MaxNoOfConcurrentTransactions = 8192
# This next line should be used in normal use of the
# cluster.
#TimeBetweenLocalCheckPoints   = 20
# This next line should be used if doing heavy loading of
# data into the cluster.
TimeBetweenLocalCheckPoints   = 6

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]
DataDir   = /var/lib/mysql-cluster

[TCP DEFAULT]

[NDB_MGMD]
HostName  = tlbfc04mgm01

[NDBD]
HostName  = tlbfc04ndb01

[NDBD]
HostName  = tlbfc04ndb02

[NDBD]
HostName  = tlbfc04ndb03

[MYSQLD]
HostName  = tlbfc04api01

[MYSQLD]
HostName  = tlbfc04api02

[MYSQLD]
[MYSQLD]

--- snip ---

How to repeat:
Unknown.  Steps taken when this occurred:

I am in the process of building a new cluster from an existing database.  So, I built the cluster which is thus:

host 1: mgm
host 2: ndbd
host 3: ndbd + api
host 4: ndbd + api

I then imported the old data into the database using the old engines for each db.  Then I wrote a script that changes all the schema tables to ndbcluster engine.  While this script was working I received the following error:

Using sql: 'alter table Breakdown engine = ndbcluster' on gmacsmartauctiondollars.Breakdown
DBD::mysql::st execute failed: Got temporary error 410 'REDO log files overloaded, consult online manual (decrease TimeBetweenLocalCheckpoints, and|or incre' from ndbcluster at /usr/local/sbin/engine2ndbcluster line 56.
Got temporary error 410 'REDO log files overloaded, consult online manual (decrease TimeBetweenLocalCheckpoints, and|or incre' from ndbcluster at /usr/local/sbin/engine2ndbcluster line 56.
        -1 were affected...

So, in my research of how to resolve this problem I increased the config.ini directive MaxNoOfAttributes to 15000 (I admit that this is somewhat of a shot in the dark value) from 5000.  I then bounced the mgmd.  Then I told nbd01 to restart.  After it came back up (finished all 5 stages of startup) I restarted ndb02 (via mgm).  This is when I received the error shown above.  Now, anytime I try and start the node I get the following error:

2006-03-23 14:50:05 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

Now, I just stopped the mgmd and restarted it.  My mgmd startup is this:
/usr/sbin/ndb_mgmd -f /var/lib/mysql-cluster/config.ini --ndb-optimized-node-selection

Suggested fix:
Bounce mgmd (I use killall ndb_mgmd) and restart it.  Then restart the affected ndbd on the appropriate storage node.

After a while of waiting I received the following line in my logfile (on ndb02)
NR: setLcpActiveStatusEnd - !m_participatingLQH

I'm researching that now but so far nothing has blown up...
[23 Mar 2006 23:55] Jim Conner
Here is the output from 'show' in mgm when the error(s) occurred:

ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error
show

Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=2    @10.10.20.101  (Version: 5.0.19, Nodegroup: 0)
id=3 (not connected, accepting connect from tlbfc04ndb02)
id=4    @10.10.20.103  (Version: 5.0.19, Nodegroup: 0, Master)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @10.10.20.100  (Version: 5.0.19)

[mysqld(API)]   4 node(s)
id=5    @10.10.20.102  (Version: 5.0.19)
id=6    @10.10.20.103  (Version: 5.0.19)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)

ndb_mgm>
ndb_mgm> showNode 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error
[24 Mar 2006 5:27] Jonas Oreland
Hi,

Can you upload the ndb*error.log and ndb*trace.log files?
[3 Apr 2006 18:16] Jim Conner
Unfortuantely, I can't do that because I removed those files a loooong time ago :(  Is there anything else I can get you?  I've pretty much given up on ndbcluster for now until it becomes a little more mature and less seemingly fragile (aka 5.1.x)
[15 May 2006 19:01] heh heh
I'm having a very similar issue, it smells the same though the error message code is slightly different:

ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase 5. In
itiated by signal 0. Caused by error 2303: 'System error, node killed during nod
e restart by other node(Internal error, programming error or missing error messa
ge, please report a bug)

I'm uploading my trace, log and ini files.
[15 May 2006 19:03] heh heh
Ooops, it won't let me add files, should I open a new bug or email these files to someone?

Thanks
[4 Aug 2006 9:17] Jonas Oreland
Latest uploaded traces indicates that copy_frag_ref failed due to 
  out of datamemory.

Makeing it duplicate of http://bugs.mysql.com/bug.php?id=20261

Solution is to
1) change memory allocation strategy (likly in 5.1/5.2)
2) Improve error message on COPY_FRAGREF
[7 Aug 2006 7:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10101

ChangeSet@1.2540, 2006-08-07 09:41:39+02:00, jonas@perch.ndb.mysql.com +1 -0
  ndb - bug#18475
    Fix error message on copyfragref
[1 Sep 2006 8:08] Jonas Oreland
pushed to 5.1.12
[1 Sep 2006 19:31] Jonas Oreland
pushed to 5.0.25
[6 Sep 2006 7:14] Jonas Oreland
pushed into 4.1.22
[6 Sep 2006 10:09] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix in 4.1.22, 5.0.25, and 5.1.12 changelogs.