MySQL Bugs: #11620: Bug #5497 test case in func_compress.test causes cluster to core

Bug #11620	Bug #5497 test case in func_compress.test causes cluster to core
Submitted:	28 Jun 2005 20:36	Modified:	22 Aug 2005 7:36
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1.0-wl2325-wl1354-new	OS:	Linux (Linux)
Assigned to:	Pekka Nousiainen	CPU Architecture:	Any

Description:
Since I could not get to the NDB systems today, I decided to test mysql-test locally on my VM system. Investigating a failure in func_concat.test I found that the test actually past, so I started looking at the one before it, func_compress.test. When running the last test out of this file the cluster cores.

set @@max_allowed_packet=1048576*100;
select compress(repeat('aaaaaaaaaa', IF('$LOW_MEMORY', 10, 10000000))) is null;

The mysld error log showed:
Node failed when TCRELEASE sent
Node failed when TCRELEASE sent

I will attach the error logs and trace files. I will move the core files to ndb under /home/jonathan/func_core

BT from each file is as follows:
core.9624
 0x00000000 in ?? ()
#1  0x00467cfd in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0x08297b5b in my_sleep (m_seconds=10000) at my_sleep.c:34
#3  0x0827d0b7 in NdbSleep_MilliSleep (milliseconds=10) at NdbSleep.c:24
#4  0x080bf465 in handler_error (signum=11) at main.cpp:370
#5  <signal handler called>
#6  0x00000000 in ?? ()
#7  0x00000000 in ?? ()

core.9590
#0  0x00412fd7 in memset () from /lib/tls/libc.so.6
#1  0xa207d000 in ?? ()
#2  0x0824e73a in Signal (this=0xa2073a5c) at VMSignal.cpp:22
#3  0x0824d64d in FastScheduler::dumpSignalMemory (this=0x8315920, output=0x9b17b30)
    at FastScheduler.cpp:398
#4  0x08257116 in WriteMessage (thrdType=ecError, thrdMessageID=0,
    thrdProblemData=0xa207bd0c "Signal 11 received", thrdObjRef=0x82aba0d "main.cpp",
    thrdTheEmulatedJamIndex=1736, thrdTheEmulatedJam=0x830ab20 "0\b")
    at ErrorReporter.cpp:319
#5  0x08256d46 in ErrorReporter::handleError (type=ecError, messageID=0,
    problemData=0xa207bd0c "Signal 11 received", objRef=0x82aba0d "main.cpp",
    nst=NST_ErrorHandlerSignal) at ErrorReporter.cpp:199
#6  0x080bf4b9 in handler_error (signum=11) at main.cpp:376
#7  <signal handler called>
#8  0x00412fd7 in memset () from /lib/tls/libc.so.6
#9  0xa2084bb0 in ?? ()
#10 0x0824e73a in Signal (this=0xa207c090) at VMSignal.cpp:22
#11 0x0824d64d in FastScheduler::dumpSignalMemory (this=0x8315920, output=0x9b17748)
    at FastScheduler.cpp:398
#12 0x08257116 in WriteMessage (thrdType=ecError, thrdMessageID=2307,
    thrdProblemData=0x82c8fbd "WatchDog terminate",
    thrdObjRef=0x82c8fb0 "WatchDog.cpp", thrdTheEmulatedJamIndex=3756,
    thrdTheEmulatedJam=0x830ab20 "0\b") at ErrorReporter.cpp:319
#13 0x08256d46 in ErrorReporter::handleError (type=ecError, messageID=2307,
    problemData=0x82c8fbd "WatchDog terminate", objRef=0x82c8fb0 "WatchDog.cpp",
---Type <return> to continue, or q <return> to quit---
    nst=NST_Watchdog) at ErrorReporter.cpp:199
#14 0x082532d7 in WatchDog::shutdownSystem (this=0x984c7a0) at WatchDog.cpp:140
#15 0x082532ac in WatchDog::run (this=0x984c7a0) at WatchDog.cpp:130
#16 0x08252edd in runWatchDog (w=0x984c7a0) at WatchDog.cpp:30
#17 0x0827d380 in ndb_thread_wrapper (_ss=0x99a3ea0) at NdbThread.c:58
#18 0x006d21d5 in start_thread () from /lib/tls/libpthread.so.0
#19 0x0046f2da in clone () from /lib/tls/libc.so.6

How to repeat:
start a two data node cluster. Start mysqld with ndb as the default engine. 
Inside the mysql client execute:
set @@max_allowed_packet=1048576*100;
select compress(repeat('aaaaaaaaaa', IF('$LOW_MEMORY', 10, 10000000))) is null;

Suggested fix:
cluster should not shutdown. Should be able to handle the query.

ndb_2_error

Attachment: ndb_2_error.log (text/x-log), 867 bytes.

ndb 2 trace 1

Attachment: ndb_2_trace.log.1 (application/octet-stream, text), 8.24 KiB.

ndb 2 trace 2

Attachment: ndb_2_trace.log.2 (application/octet-stream, text), 8.24 KiB.

ndb_3_error

Attachment: ndb_3_error.log (text/x-log), 867 bytes.

ndb 3 trace 1

Attachment: ndb_3_trace.log.1 (application/octet-stream, text), 8.24 KiB.

ndb 3 trace 2

Attachment: ndb_3_trace.log.2 (application/octet-stream, text), 8.24 KiB.

The query (a test of mysqld max memory?) does not contact NDB.

It seems that mysqld used up all resources (memory, cpu). 
The db nodes worked too slowly.  This causes "watchdog" thread
to believe ndbd is stuck in a loop and terminates ndbd.

shouldn't the cluster know that we have a resources and react correctly to it. Maybe putting an error in the log that we are out of resources and then shutting down gracefully?

should have stated:
shouldn't the cluster know that we have a resources issue and react correctly to it. Maybe putting an error in the log that we are out of resources and then shutting down gracefully?

> shouldn't the cluster know that we have a resources issue

It just detects that ndbd is "stuck" (does not process new signals, etc).
It could be for any reason.  The action is to shut down the node so
that it can be restarted.

My point is that the cluster should be aware of system resources and know when we are running out and be proactive rather then reactive.