Bug #29390 too complex interpreted program crashes data nodes
Submitted: 27 Jun 2007 14:22 Modified: 5 Nov 2007 21:12
Reporter: Hartmut Holzgraefe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1 OS:Linux (x86 32bit)
Assigned to: Pekka Nousiainen CPU Architecture:Any

[27 Jun 2007 14:22] Hartmut Holzgraefe
Description:
This came up as part of Bug #29185 which was about the mysqld
side of this problem, turns out that it is not handled well on
the ndbd side either

A large interpreted program can crash all ndbd nodes simultaneously
due to a buffer overrun in copyAttrinfo() in DbtupExecQuery.cpp

Problem here is that the program is copied over into the inBuffer
without taking any care of that buffers size.

- copyAttrinfo() only gets the buffers address as parameter
  but no size information so it has no chance for overflows
  right now anyway

- copyAttrinfo() doesn't have any way to report errors back
  to its caller yet either

A large program can be created by a very complex NdbScanFilter
for example. NdbScanFilter has no means to handle the situation
either, it doesn't know about the inBuffer size at all.

Even if execTUPKEYREF(), which is the only caller of copyAttrinfo(),
would do error handling this would only happen in the execution
phase, so the best it could do would be to return a permanent
error to make the transactions execute() fail. 

During the preparation phase where the NdbScanFilter is set
up there still wouldn't be any error indicating that the filter
became too large/complex.

And even if NdbScanFilter would handle this there would still
be a problem with engine_condition_pushdown handling as the
scan filter is only constructed in the various handler scan
methods, so within the handlers cond_push handler it would not
be clear whether a limit was hit or not for the simple reason
that the NdbScanFilter doesn't get created at this time yet.

How to repeat:
with condition pushdown enabled to a query with a *large* IN(...,...,...) list
(remember that in MySQL the size of the IN() argument list is only limited by max_allowed_package and so by the max. possible length for a SQL query string, there is no artificial limit like in Oracle where you may only have up to 1000 IN() argument entries)

on our 64bit test systems this somehow worked fine, 
on 32bit x86 it easily runs into problems though that
cause all nodes to die with error code 6000 as they
all receive a segfault at the same time:

#0  0xffffe410 in __kernel_vsyscall ()
#1  0x400db541 in raise () from /lib/tls/libc.so.6
#2  0x400dcdbb in abort () from /lib/tls/libc.so.6
#3  0x080e4e16 in childAbort (code=-1, currentStartPhase=255) at main.cpp:104
#4  0x08302958 in NdbShutdown (type=NST_ErrorHandlerSignal, restartType=NRT_Default) at Emulator.cpp:254
#5  0x0830d1dc in ErrorReporter::handleError (messageID=6000, problemData=0xbfa8aea8 "Signal 11 received; Segmentation fault", 
    objRef=0x838333b "main.cpp", nst=NST_ErrorHandlerSignal) at ErrorReporter.cpp:210
#6  0x080e6143 in handler_error (signum=11) at main.cpp:639
#7  <signal handler called>
#8  0x4011f58c in memcpy () from /lib/tls/libc.so.6
#9  0x082347c6 in Dbtup::copyAttrinfo (this=0x407c8008, regOperPtr=0x52d31154, inBuffer=0x407f1c88) at DbtupExecQuery.cpp:82
#10 0x0823c2b6 in Dbtup::execTUPKEYREQ (this=0x407c8008, signal=0x84f2e5c) at DbtupExecQuery.cpp:681
#11 0x0810bbe7 in SimulatedBlock::executeFunction (this=0x407c8008, gsn=436, signal=0x84f2e5c) at SimulatedBlock.hpp:577
#12 0x0810bd7d in SimulatedBlock::EXECUTE_DIRECT (this=0x85b69c0, block=249, gsn=436, signal=0x84f2e5c, len=18) at SimulatedBlock.hpp:752
#13 0x081cb9d6 in Dblqh::next_scanconf_tupkeyreq (this=0x85b69c0, signal=0x84f2e5c, scanPtr={p = 0x86a1630, i = 0}, regTcPtr=0x5c0bb254, 
    fragPtrP=0x5afd4f20, disk_page=4294967040) at DblqhMain.cpp:9175
#14 0x081ecf9d in Dblqh::nextScanConfLoopLab (this=0x85b69c0, signal=0x84f2e5c) at DblqhMain.cpp:9074
#15 0x081ed49f in Dblqh::nextScanConfScanLab (this=0x85b69c0, signal=0x84f2e5c) at DblqhMain.cpp:9039
#16 0x081ed7cc in Dblqh::execNEXT_SCANCONF (this=0x85b69c0, signal=0x84f2e5c) at DblqhMain.cpp:7772
#17 0x0810bbe7 in SimulatedBlock::executeFunction (this=0x85b69c0, gsn=330, signal=0x84f2e5c) at SimulatedBlock.hpp:577
#18 0x0810bd7d in SimulatedBlock::EXECUTE_DIRECT (this=0x85ca858, block=247, gsn=330, signal=0x84f2e5c, len=6) at SimulatedBlock.hpp:752
#19 0x082d707f in Dbtux::execACC_CHECK_SCAN (this=0x85ca858, signal=0x84f2e5c) at DbtuxScan.cpp:557
#20 0x0810bbe7 in SimulatedBlock::executeFunction (this=0x85ca858, gsn=72, signal=0x84f2e5c) at SimulatedBlock.hpp:577
#21 0x0810bd7d in SimulatedBlock::EXECUTE_DIRECT (this=0x85ca858, block=258, gsn=72, signal=0x84f2e5c, len=2) at SimulatedBlock.hpp:752
#22 0x082d7d1f in Dbtux::execNEXT_SCANREQ (this=0x85ca858, signal=0x84f2e5c) at DbtuxScan.cpp:380
#23 0x0810bbe7 in SimulatedBlock::executeFunction (this=0x85ca858, gsn=332, signal=0x84f2e5c) at SimulatedBlock.hpp:577
#24 0x082ff0c9 in FastScheduler::doJob (this=0x84effc0) at FastScheduler.cpp:136
#25 0x083002cd in ThreadConfig::ipControlLoop (this=0x8503c60) at ThreadConfig.cpp:153
#26 0x080e5fa2 in main (argc=1, argv=0xbfa8bfa4) at main.cpp:473
(gdb) Quit

Suggested fix:
Possible solutions:

- make the program buffer in DbtupExecQuery.cpp dynamicly
  sized/allocated instead of the static buffer we have now
  => would resolve all problems at once as the limitation
  is lifted alltogether but violates "all memory is allocated
  staticly" scheme?

- just fail in execTUPKEYREF() - would prevent the cluster 
  system crash but would lead to unexecuteable queries

- handle it in NdbScanFilter, too - still too late in the
  game to disable condition pushdown for this query, 
  NdbScanFilter is only constructed *after* cond_push()
  is called

- somehow handle this in cond_push() already, e.g. by
  introducing a hard coded artificial limit in there?
[3 Aug 2007 16:41] Magnus Blåudd
Small test file

Attachment: ndb_ms.test (application/octet-stream, text), 77.21 KiB.

[3 Aug 2007 16:49] Magnus Blåudd
Had a look at this problem and created a small test case(with a large query). The query will create a scan on a table with an IN an non indexed column - thus creating many ATTRINFO signals.

With a number of values in th IN that is just above the limit Dbtc will actually return error 207(ZLENGTH_ERROR) but when increasing the number of values the crash will occur in DbtupExecQuery.cpp

The max number of words in ATTRINFO should be limited to a 16 bit value and thus the static buffer should be enough. But it wraps around somehow.

Both the NdbApi, Dbtc and all the other blocks receiving the ATTRINFOs should check this limit.

Also noted that the ATTRINFO are copied into the buffer for each record in the scan - oops! Although I think that is actually by design.
[4 Oct 2007 9:32] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34882

ChangeSet@1.2485, 2007-10-04 11:32:49+02:00, pekka@sama.ndb.mysql.com +10 -0
  ndb - bug#29390: if ScanFilter is too large, abort or optionally discard it
[14 Oct 2007 14:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/35527

ChangeSet@1.2486, 2007-10-14 16:17:39+02:00, pekka@sama.ndb.mysql.com +1 -0
  ndb - bug#29390: fix mem leak introduced in previous cset
[15 Oct 2007 18:02] Jon Stephens
Documented in mysql-5.1-ndb-6.3.4 changelog as:

            Interpeted programs of sufficient size and complexity could
            cause all cluster data nodes to shut down due to buffer
            overruns.

Left status as Patch Pending.
[5 Nov 2007 13:53] Bugs System
Pushed into 6.0.4-alpha
[5 Nov 2007 13:56] Bugs System
Pushed into 5.1.23-rc
[5 Nov 2007 13:58] Bugs System
Pushed into 5.0.52
[5 Nov 2007 21:12] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented fix in 5.0.52, 5.1.23, and 6.0.4 changelogs. Closed.