Bug #34102 Creating LOGFILE group crashes cluster
Submitted: 28 Jan 2008 12:53 Modified: 19 Feb 2009 14:27
Reporter: Johan Andersson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S3 (Non-critical)
Version:5.1.23 ndb 6.3.7; ndb 6.3.13 OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: cluster disk data, disk data, diskdata

[28 Jan 2008 12:53] Johan Andersson
Description:
2 data node vanilla setup:

[ndbd default]
NoOfReplicas=2
LockPagesInMainMemory=1
DataMemory=2000M
IndexMemory=200M
ODirect=1
NoOfFragmentLogFiles=50
FragmentLogFileSize=64M
datadir=/data1/johan/mysqlcluster
MaxNoOfConcurrentOperations=500000
MaxNoOfConcurrentTransactions=32768
SchedulerSpinTimer=400
SchedulerExecutionTimer=80
RealTimeScheduler=1
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=200
Diskcheckpointspeed=10M
Diskcheckpointspeedinrestart=100M
RedoBuffer=32M
SharedGlobalMemory=256M

I create a logfile group (with undo buffer size  = 192M ):

mysql> CREATE LOGFILE GROUP lg_1 ADD UNDOFILE '/data0/johan/mysqlcluster/undo_1.dat' INITIAL_SIZE=4096M UNDO_BUFFER_SIZE=192M ENGINE=ndb;
ERROR 1528 (HY000): Failed to create LOGFILE GROUP
mysql> show errors;
+-------+------+-------------------------------------------+
| Level | Code | Message                                   |
+-------+------+-------------------------------------------+
| Error | 1296 | Got error 4009 'Cluster Failure' from NDB | 
| Error | 1528 | Failed to create LOGFILE GROUP            | 
+-------+------+-------------------------------------------+
2 rows in set (0.00 sec)

Both data nodes are down...

Time: Monday 28 January 2008 - 13:37:30
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: lgman.cpp
Error object: LGMAN (Line: 912) 0x0000000e
Program: ndbd
Pid: 24033
Trace: /data1/johan/mysqlcluster/ndb_2_trace.log.5
Version: mysql-5.1.23 ndb-6.3.7-beta
***EOM***

lgman.cpp (line 912) has this code:

 Page_map map(m_data_buffer_pool, ptr.p->m_buffer_pages);
    while(pages)
    {
      Uint32 ptrI;
      Uint32 cnt = pages > 64 ? 64 : pages;
      m_ctx.m_mm.alloc_pages(RG_DISK_OPERATIONS, &ptrI, &cnt, 1);
      if (cnt)
      {
        Buffer_idx range;
        range.m_ptr_i= ptrI;
        range.m_idx = cnt;

###line 912###        ndbrequire(map.append((Uint32*)&range, 2));
        pages -= range.m_idx;
      }

So it seems to fail to allocate disk operations.. However, i think it should return an error message instead of crashing the data nodes.

Moreover, a subsequent system restart fails:

Time: Monday 28 January 2008 - 13:45:42
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 3560) 0x0000000a
Program: ndbd
Pid: 24121
Trace: /data1/johan/mysqlcluster/ndb_2_trace.log.7
Version: mysql-5.1.23 ndb-6.3.7-beta
***EOM***

How to repeat:
Have:
SharedGlobalMemory=256M

Create a logfile group with a quite big undo_buffer_size:

CREATE LOGFILE GROUP lg_1 ADD UNDOFILE '/data0/johan/mysqlcluster/undo_1.dat' INITIAL_SIZE=4096M UNDO_BUFFER_SIZE=192M ENGINE=ndb;

Suggested fix:
-
[13 May 2008 22:00] Hartmut Holzgraefe
still reproducible with ndb-6.3.13
[13 May 2008 22:06] Hartmut Holzgraefe
Restart problem reported as new bug #36702
[19 Feb 2009 10:01] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/66857

2846 Jonas Oreland	2009-02-19
      ndb - bug#34102 - lgman crashed if using more that 150M undo-buffer-memory, increase limit to 600M and don't crash
[19 Feb 2009 10:32] Bugs System
Pushed into 5.1.32-ndb-6.2.17 (revid:jonas@mysql.com-20090219100101-thq39n075vk91jj2) (version source revid:jonas@mysql.com-20090219100101-thq39n075vk91jj2) (merge vers: 5.1.32-ndb-6.2.17) (pib:6)
[19 Feb 2009 10:33] Bugs System
Pushed into 5.1.32-ndb-6.4.3 (revid:jonas@mysql.com-20090219101945-mi9ni9z66ctoswbi) (version source revid:jonas@mysql.com-20090219101945-mi9ni9z66ctoswbi) (merge vers: 5.1.32-ndb-6.4.3) (pib:6)
[19 Feb 2009 10:36] Bugs System
Pushed into 5.1.32-ndb-6.3.23 (revid:jonas@mysql.com-20090219103357-fcemygrfinsopjmp) (version source revid:jonas@mysql.com-20090219100413-a1hp7s0agpgl9nxk) (merge vers: 5.1.32-ndb-6.3.23) (pib:6)
[19 Feb 2009 14:27] Jon Stephens
Documented in the NDB-6.2.17, 6.3.23, and 6.4.3 changelogs as follows:

        Trying to execute a CREATE LOGFILE GROUP statement using a value
        greater than 150M for UNDO_BUFFER_SIZE caused data nodes to
        crash.

        As a result of this fix, the upper limit for UNDO_BUFFER_SIZE is
        now 600M.

Also noted the before-and-after limits under "CREATE LOGFILE GROUP Syntax".