Bug #11620 Bug #5497 test case in func_compress.test causes cluster to core
Submitted: 28 Jun 2005 20:36 Modified: 22 Aug 2005 7:36
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1.0-wl2325-wl1354-new OS:Linux (Linux)
Assigned to: Pekka Nousiainen CPU Architecture:Any

[28 Jun 2005 20:36] Jonathan Miller
Description:
Since I could not get to the NDB systems today, I decided to test mysql-test locally on my VM system. Investigating a failure in func_concat.test I found that the test actually past, so I started looking at the one before it, func_compress.test. When running the last test out of this file the cluster cores.

set @@max_allowed_packet=1048576*100;
select compress(repeat('aaaaaaaaaa', IF('$LOW_MEMORY', 10, 10000000))) is null;

The mysld error log showed:
Node failed when TCRELEASE sent
Node failed when TCRELEASE sent

I will attach the error logs and trace files. I will move the core files to ndb under /home/jonathan/func_core

BT from each file is as follows:
core.9624
 0x00000000 in ?? ()
#1  0x00467cfd in ___newselect_nocancel () from /lib/tls/libc.so.6
#2  0x08297b5b in my_sleep (m_seconds=10000) at my_sleep.c:34
#3  0x0827d0b7 in NdbSleep_MilliSleep (milliseconds=10) at NdbSleep.c:24
#4  0x080bf465 in handler_error (signum=11) at main.cpp:370
#5  <signal handler called>
#6  0x00000000 in ?? ()
#7  0x00000000 in ?? ()

core.9590
#0  0x00412fd7 in memset () from /lib/tls/libc.so.6
#1  0xa207d000 in ?? ()
#2  0x0824e73a in Signal (this=0xa2073a5c) at VMSignal.cpp:22
#3  0x0824d64d in FastScheduler::dumpSignalMemory (this=0x8315920, output=0x9b17b30)
    at FastScheduler.cpp:398
#4  0x08257116 in WriteMessage (thrdType=ecError, thrdMessageID=0,
    thrdProblemData=0xa207bd0c "Signal 11 received", thrdObjRef=0x82aba0d "main.cpp",
    thrdTheEmulatedJamIndex=1736, thrdTheEmulatedJam=0x830ab20 "0\b")
    at ErrorReporter.cpp:319
#5  0x08256d46 in ErrorReporter::handleError (type=ecError, messageID=0,
    problemData=0xa207bd0c "Signal 11 received", objRef=0x82aba0d "main.cpp",
    nst=NST_ErrorHandlerSignal) at ErrorReporter.cpp:199
#6  0x080bf4b9 in handler_error (signum=11) at main.cpp:376
#7  <signal handler called>
#8  0x00412fd7 in memset () from /lib/tls/libc.so.6
#9  0xa2084bb0 in ?? ()
#10 0x0824e73a in Signal (this=0xa207c090) at VMSignal.cpp:22
#11 0x0824d64d in FastScheduler::dumpSignalMemory (this=0x8315920, output=0x9b17748)
    at FastScheduler.cpp:398
#12 0x08257116 in WriteMessage (thrdType=ecError, thrdMessageID=2307,
    thrdProblemData=0x82c8fbd "WatchDog terminate",
    thrdObjRef=0x82c8fb0 "WatchDog.cpp", thrdTheEmulatedJamIndex=3756,
    thrdTheEmulatedJam=0x830ab20 "0\b") at ErrorReporter.cpp:319
#13 0x08256d46 in ErrorReporter::handleError (type=ecError, messageID=2307,
    problemData=0x82c8fbd "WatchDog terminate", objRef=0x82c8fb0 "WatchDog.cpp",
---Type <return> to continue, or q <return> to quit---
    nst=NST_Watchdog) at ErrorReporter.cpp:199
#14 0x082532d7 in WatchDog::shutdownSystem (this=0x984c7a0) at WatchDog.cpp:140
#15 0x082532ac in WatchDog::run (this=0x984c7a0) at WatchDog.cpp:130
#16 0x08252edd in runWatchDog (w=0x984c7a0) at WatchDog.cpp:30
#17 0x0827d380 in ndb_thread_wrapper (_ss=0x99a3ea0) at NdbThread.c:58
#18 0x006d21d5 in start_thread () from /lib/tls/libpthread.so.0
#19 0x0046f2da in clone () from /lib/tls/libc.so.6

How to repeat:
start a two data node cluster. Start mysqld with ndb as the default engine. 
Inside the mysql client execute:
set @@max_allowed_packet=1048576*100;
select compress(repeat('aaaaaaaaaa', IF('$LOW_MEMORY', 10, 10000000))) is null;

Suggested fix:
cluster should not shutdown. Should be able to handle the query.
[28 Jun 2005 20:37] Jonathan Miller
ndb_2_error

Attachment: ndb_2_error.log (text/x-log), 867 bytes.

[28 Jun 2005 20:38] Jonathan Miller
ndb 2 trace 1

Attachment: ndb_2_trace.log.1 (application/octet-stream, text), 8.24 KiB.

[28 Jun 2005 20:38] Jonathan Miller
ndb 2 trace 2

Attachment: ndb_2_trace.log.2 (application/octet-stream, text), 8.24 KiB.

[28 Jun 2005 20:38] Jonathan Miller
ndb_3_error

Attachment: ndb_3_error.log (text/x-log), 867 bytes.

[28 Jun 2005 20:39] Jonathan Miller
ndb 3 trace 1

Attachment: ndb_3_trace.log.1 (application/octet-stream, text), 8.24 KiB.

[28 Jun 2005 20:39] Jonathan Miller
ndb 3 trace 2

Attachment: ndb_3_trace.log.2 (application/octet-stream, text), 8.24 KiB.

[6 Jul 2005 9:17] Pekka Nousiainen
The query (a test of mysqld max memory?) does not contact NDB.

It seems that mysqld used up all resources (memory, cpu). 
The db nodes worked too slowly.  This causes "watchdog" thread
to believe ndbd is stuck in a loop and terminates ndbd.
[6 Jul 2005 13:44] Jonathan Miller
shouldn't the cluster know that we have a resources and react correctly to it. Maybe putting an error in the log that we are out of resources and then shutting down gracefully?
[6 Jul 2005 13:58] Jonathan Miller
should have stated:
shouldn't the cluster know that we have a resources issue and react correctly to it. Maybe putting an error in the log that we are out of resources and then shutting down gracefully?
[6 Jul 2005 14:18] Pekka Nousiainen
> shouldn't the cluster know that we have a resources issue

It just detects that ndbd is "stuck" (does not process new signals, etc).
It could be for any reason.  The action is to shut down the node so
that it can be restarted.
[6 Jul 2005 14:26] Jonathan Miller
My point is that the cluster should be aware of system resources and know when we are running out and be proactive rather then reactive.