Bug #26641 Temporary error 1501 'Out of undo space' kills datanode
Submitted: 26 Feb 2007 16:57 Modified: 4 Jul 2007 12:25
Reporter: Erik Hoekstra Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S2 (Serious)
Version:5.1.14(beta) OS:Linux (Red Hat ES 4)
Assigned to: CPU Architecture:Any
Tags: 1501, lgman, ndbrequire, undo space, undo_buffer, undo_buffer_size

[26 Feb 2007 16:57] Erik Hoekstra
Description:
ERROR 1297 (HY000) at line 1: Got temporary error 1501 'Out of undo space' 
from NDBCLUSTER kills a datanode.

I'm working on about 230 MyISAM tables, currently holding archive data so no changes are made to those. I'm trying to get those tables in NDBCLUSTER, using tablespaces for disk clustering so there data won't eatup all my Gigs of RAM.

After a while ALTER TABLE'ing ERROR 1297 comes along followed by the next error messages @ ndb_mgm and the datanode it self;

Time: Monday 26 February 2007 - 15:27:25
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, 
programming error or missing error message, please report a bug
)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 1778) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 16462
Trace: /usr/local/mysql/data/ndb_14_trace.log.3
Version: Version 5.1.14 (beta)
***EOM***

lgman (logmanager?) @ 1778:

01775       undo[2] |= File_formats::Undofile::UNDO_NEXT_LSN << 16;
01776       Uint32 *dst= get_log_buffer(ptr, sizeof(undo) >> 2);
01777       memcpy(dst, undo, sizeof(undo));
01778       ndbrequire(ptr.p->m_free_file_words >= (sizeof(undo) >> 2));
01779       ptr.p->m_free_file_words -= (sizeof(undo) >> 2);

2007-02-26 16:30:28 [MgmSrvr] INFO     -- Node 12: Local checkpoint 1553 
started. Keep GCI = 2171832 oldest restorable GCI = 2171947
2007-02-26 16:36:18 [MgmSrvr] INFO     -- Node 12: Local checkpoint 1554 
started. Keep GCI = 2173457 oldest restorable GCI = 2173556
2007-02-26 16:42:10 [MgmSrvr] INFO     -- Node 12: Local checkpoint 1555 
started. Keep GCI = 2173622 oldest restorable GCI = 2173749
2007-02-26 16:55:05 [MgmSrvr] INFO     -- Node 12: Local checkpoint 1556 
started. Keep GCI = 2173804 oldest restorable GCI = 2173911
2007-02-26 16:57:15 [MgmSrvr] ALERT    -- Node 11: Node 14 Disconnected
2007-02-26 16:57:15 [MgmSrvr] INFO     -- Node 11: Communication to Node 14 
closed
2007-02-26 16:57:15 [MgmSrvr] ALERT    -- Node 12: Node 14 Disconnected
2007-02-26 16:57:15 [MgmSrvr] INFO     -- Node 12: Communication to Node 14 
closed
2007-02-26 16:57:15 [MgmSrvr] INFO     -- Node 12: Communication to Node 14 
closed
2007-02-26 16:57:15 [MgmSrvr] INFO     -- Node 1: Node 14 Connected
2007-02-26 16:57:15 [MgmSrvr] ALERT    -- Node 12: Arbitration check won - 
node group majority
2007-02-26 16:57:15 [MgmSrvr] INFO     -- Node 12: President restarts 
arbitration thread [state=6]
2007-02-26 16:57:15 [MgmSrvr] INFO     -- Node 12: DICT: lock bs: 3 ops: 1 
poll: 0 cnt: 0 queue:
2007-02-26 16:57:15 [MgmSrvr] ALERT    -- Node 13: Node 14 Disconnected
2007-02-26 16:57:15 [MgmSrvr] INFO     -- Node 13: Communication to Node 14 
closed
2007-02-26 16:57:15 [MgmSrvr] INFO     -- Node 13: Communication to Node 14 
closed
2007-02-26 16:57:16 [MgmSrvr] ALERT    -- Node 14: Forced node shutdown 
completed. Initiated by signal 0. Caused by error 2341: 'Int
ernal program error (failed ndbrequire)(Internal error, programming error or 
missing error message, please report a bug). Temporary
error, restart node'.
2007-02-26 16:58:14 [MgmSrvr] WARNING  -- Node 11: Failure handling of node 
14 has not completed in 1 min. - state = 3
2007-02-26 16:58:14 [MgmSrvr] WARNING  -- Node 12: Failure handling of node 
14 has not completed in 1 min. - state = 3
2007-02-26 16:58:14 [MgmSrvr] WARNING  -- Node 13: Failure handling of node 
14 has not completed in 1 min. - state = 3

How to repeat:

CREATE LOGFILE GROUP a_loggroup
ADD UNDOFILE './loggroups/a_undo.dat'
INITIAL_SIZE 10M
ENGINE NDBCLUSTER;

CREATE TABLESPACE a_archive_01
ADD DATAFILE './tablespaces/a_archive_01.dat'
USE LOGFILE GROUP affiliates_loggroup
INITIAL_SIZE 12M
ENGINE NDBCLUSTER;

Couple of times:
ALTER TABLE x TABLESPACE a_archive_01 STORAGE DISK, ENGINE NDBCLUSTER;

etc.

...

Suggested fix:
Tried using a bigger undo_buffer_size for the LOGFILE GROUP, didn't work...