MySQL Bugs: #19669: Sometimes StartFailureTimeout isn't applied properly

Bug #19669	Sometimes StartFailureTimeout isn't applied properly
Submitted:	10 May 2006 10:45	Modified:	12 Sep 2006 2:18
Reporter:	Serge Kozlov	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1.10 (beta)	OS:	Linux (FC4)
Assigned to:	david li	CPU Architecture:	Any

Description:
When a cluster is starting it will ignores StartFailureTimeout option if some variables in config.ini have very large values. If remove StartFailureTimeout  from config.ini and start the cluster again then one will stop because it can't allocate a memory.

How to repeat:
1. Add to config.ini following lines: 

StartFailureTimeout : 1000 # 1 sec
MaxNoOfTables: 1M

2. Start cluster.

3. Data nodes should stop in 1 sec but nothing occurs and the cluster continues starting.

I did the following tests:

Simple Discription:
1. set StartFailureTimeout & MaxNoOfTables in config.ini
   (according to the bug report)
2. set StartFailureTimeout, not set MaxNoOfTables in config.ini
   (to check whether StartFailureTimeout parameter is effective)
3. set StartFailureTimeout, not set MaxNoOfTables in config.ini,
   let it sleep 2 seconds in Dbtup::initRecords().
   (to check whether MaxNoOfTables parameter is processed before
    StartFailureTimeout)
   
Detailed Description: 

********************
* 1. set StartFailureTimeout & MaxNoOfTables in config.ini
********************

1.1 config.ini

[ndbd default]
NoOfReplicas= 2
DataMemory= 10M
IndexMemory= 10M
StartFailureTimeout=1000 # 1sec
MaxNoOfTables=1000000000 # more than the total memory of test box

[ndb_mgmd]
Id=1
HostName= 127.0.0.1
DataDir=/usr/local/mysql/data

[ndbd]
Id= 2
HostName= 127.0.0.1
DataDir= /usr/local/mysql/data/node2
... (4 ndb nodes)

1.2 start cluster 

$libexec/ndb_mgmd
$libexec/ndbd --initial

1.3 test result

ndbd starting
after n seconds ( n > StartFailureTimeout(1 second) ) 
start failed: cannot allocate memory.

ndb_2_out.log
2006-09-11 16:41:48 [ndbd] ALERT    -- Node 2: Forced node shutdown
completed. Occured during startphase 0. Initiated by signal 6. Caused by
error 2327: 'Memory allocation failure, please decrease some
configuration parameters(Configuration error). Permanent error, external
action needed'.

1.4 the call stack
(gdb) bt
#0  0x00bf8c38 in raise () from /lib/tls/i686/libc.so.6
#1  0x00bfa0b8 in abort () from /lib/tls/i686/libc.so.6
#2  0x080e0c8a in childAbort (code=-1, currentStartPhase=0) at
main.cpp:105
#3  0x082ae545 in NdbShutdown (type=NST_ErrorHandlerStartup,
restartType=NRT_Default)
    at Emulator.cpp:255
#4  0x082b799e in ErrorReporter::handleError (messageID=2327,
    problemData=0xbffff630 "DBTUP could not allocate memory for
TableDescriptor",
    objRef=0xbffff530 "Requested: 4x2820146816 = 2690652672 bytes",
nst=NST_ErrorHandlerStartup)
    at ErrorReporter.cpp:206
#5  0x082a6182 in SimulatedBlock::allocRecord (this=0xb7be4008,
type=0x8331647 "TableDescriptor",
    s=4, n=2820146816, clear=true) at SimulatedBlock.cpp:683
#6  0x0821cd90 in Dbtup::initRecords (this=0xb7be4008) at
dbtup/DbtupGen.cpp:366
#7  0x0821ca14 in Dbtup::execREAD_CONFIG_REQ (this=0xb7be4008,
signal=0x847a2a4)
    at dbtup/DbtupGen.cpp:304
#8  0x08103b5d in SimulatedBlock::executeFunction (this=0xb7be4008,
gsn=334, signal=0x847a2a4)
    at SimulatedBlock.hpp:575
#9  0x082aa72d in FastScheduler::doJob (this=0x8483bc0) at
FastScheduler.cpp:137
#10 0x082ac1a1 in ThreadConfig::ipControlLoop (this=0x848cdc0) at
ThreadConfig.cpp:175
#11 0x080e1b23 in main (argc=2, argv=0xbffffa84) at main.cpp:470

********************
* 2. set StartFailureTimeout, not set MaxNoOfTables in config.ini
********************

2.1 config.ini

[ndbd default]
NoOfReplicas= 2
DataMemory= 10M
IndexMemory= 10M
StartFailureTimeout=1000 # 1sec

[ndb_mgmd]
Id=1
HostName= 127.0.0.1
DataDir=/usr/local/mysql/data

[ndbd]
Id= 2
HostName= 127.0.0.1
DataDir= /usr/local/mysql/data/node2
... (4 ndb nodes)

2.2 start cluster 

$libexec/ndb_mgmd
$libexec/ndbd --initial

2.3 test result

ndbd starting
after 1 second ( StartFailureTimeout(1 second) ) 
start failed: timeout.

ndb_2_out.log:
2006-09-11 16:08:43 [ndbd] ALERT    -- Node 2: Forced node shutdown
completed. Occured during startphase 1. Initiated by signal 6. Caused by
error 2303: 'System error, node killed during node restart by other
node(Internal error, programming error or missing error message, please
report a bug)

2.4 the call stack

(gdb) bt
#0  0x00bf8c38 in raise () from /lib/tls/i686/libc.so.6
#1  0x00bfa0b8 in abort () from /lib/tls/i686/libc.so.6
#2  0x080e0c8a in childAbort (code=-1, currentStartPhase=1) at
main.cpp:105
#3  0x082ae545 in NdbShutdown (type=NST_ErrorHandlerStartup,
restartType=NRT_Default)
    at Emulator.cpp:255
#4  0x082b799e in ErrorReporter::handleError (messageID=2303,
    problemData=0x878bf70 "Shutting down node as total restart time
exceeds  StartFailureTimeout as set in config file  1000",
objRef=0xbffff6e0 "QMGR (Line: 172) 0x0000000e",
    nst=NST_ErrorHandlerStartup) at ErrorReporter.cpp:206
#5  0x082a6338 in SimulatedBlock::progError (this=0x8549178, line=172,
err_code=2303,
    extra=0x878bf70 "Shutting down node as total restart time exceeds
StartFailureTimeout as set in config file  1000") at
SimulatedBlock.cpp:738
#6  0x08240eca in Qmgr::execCONTINUEB (this=0x8549178, signal=0x847a2a4)
at qmgr/QmgrMain.cpp:172
#7  0x08103b5d in SimulatedBlock::executeFunction (this=0x8549178,
gsn=164, signal=0x847a2a4)
    at SimulatedBlock.hpp:575
#8  0x082aa72d in FastScheduler::doJob (this=0x8483bc0) at
FastScheduler.cpp:137
#9  0x082ac1a1 in ThreadConfig::ipControlLoop (this=0x848cdc0) at
ThreadConfig.cpp:175
#10 0x080e1b23 in main (argc=2, argv=0xbffffa74) at main.cpp:470

********************
* 3. set StartFailureTimeout, not set MaxNoOfTables in config.ini,
*    let it sleep 2 seconds in Dbtup::initRecords().
********************

3.1 config.ini

the same as test 2.

3.2 start cluster 

$libexec/ndb_mgmd
$libexec/ndbd --initial

3.3 test result

ndbd starting
after 3 second ( sleep time(2 seconds) + StartFailureTimeout(1
second) ) 
start failed: timeout.

ndb_2_out.log:

the same as test 2.

3.4 the call stack

the same as test 2.

***************
* Conclusion
***************

The StartFailureTimeout parameter is processed in CONTINUEB signal in 
QMGR block, the MaxNoOfTables in READ_CONFIG_REQ signal in DBTUP
block.

The READ_CONFIG_REQ signal is processed before CONTINUEB signal.

If set MaxNoOfTables to a very large value in config.ini, because
MaxNoOfTables is processed earlier, it will fail in error 'cannot 
allocate memory' not in 'start timeout'.

not a bug.