MySQL Bugs: #42052: ndbd - Received signal 6. Running error handler

Bug #42052	ndbd - Received signal 6. Running error handler
Submitted:	12 Jan 2009 14:33	Modified:	12 Oct 2009 9:47
Reporter:	Gerhard Fürnkranz	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-6.4	OS:	Solaris (Solaris 10 / Sparc)
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	6.4

Description:
During execution of a query like

UPDATE table1 AS s,
       table1 AS r
   SET s.field1 = r.field1
 WHERE r.field2 = s.field2;

(which is a rather lager transaction updating
about 85000 records) ndbd crashed.

Cut-out from log file:

[...]
2009-01-07 14:54:40 [ndbd] INFO     -- Received signal 6. Running error handler.
2009-01-07 14:54:41 [ndbd] INFO     -- Signal 6 received; Abort
2009-01-07 14:54:41 [ndbd] INFO     -- main.cpp
2009-01-07 14:54:41 [ndbd] INFO     -- Error handler signal shutting down system
2009-01-07 14:54:41 [ndbd] INFO     -- Error handler shutdown completed - exitin
g
2009-01-07 14:54:41 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. I
nitiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal e
rror, programming error or missing error message, please report a bug). Temporar
y error, restart node'.

How to repeat:
The problem seems to be reprodicuble in our environment at any time (I've no idea however, how to repeat it isolated, outside of our environment).

Log files

Attachment: ndb_server_crash.tar.gz (application/x-gzip, text), 174.25 KiB.

Gerhard,

so we have a guess what this problem is.  But to verify this we would like to see a backtrace from the core that you get.

Please let us know if you need help on how to get the backtrace.

BR,

Tomas

Sorry, I did not find any core. On the machine all core files are directed to the /TspCore directory, but unfortunately I did not find any core from ndbd there.

# coreadm
     global core file pattern:
     global core file content: default
       init core file pattern: /TspCore/core.%f.%p.%t
       init core file content: default
            global core dumps: disabled
       per-process core dumps: enabled
      global setid core dumps: disabled
 per-process setid core dumps: enabled
     global core dump logging: disabled

Gerhard,

is it possible for you to configure your system so that you get core's?

BR,

Tomas

On the system we certainly do get core dumps from other processes, but we don't get a core dump from ndbd - so far I was not able to figure out why. We'll try to run ndbd under control of dbx in order to get a stack backtrace from the crash.

-Gerhard

Stack backtrace

Attachment: stack_backtrace1.txt (text/plain), 20.53 KiB.

Gerhard,

Thank you very much, it verifies that it is the error we were thinking about.  We know what it is and how to fix it.  It is triggered by large transactions such as yours. Hopefully we will be able to get it into the next release of 6.4.

BR,

Tomas

To clarify, this is a problem with all large transactions in 6.4, whether it be updates, deletes, or inserts.  All will be addressed with the same bug fix.

BR,

Tomas

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/63670

3223 Jonas Oreland	2009-01-21
      ndbmtd - 
        1) OJ optimizations developed for cmt/bw
        2) pessemstic scheduling (update_sched_config), bug#42052
           only execute signals if space exist to send to 
           other-threads in system
        3) new NdbCondition_ComputeAbsTime/NdbCondition_WaitTimeoutAbs

Pushed into 5.1.31-ndb-6.4.1 (revid:jonas@mysql.com-20090121095501-mxb7w5hi56lzp1jr) (version source revid:jonas@mysql.com-20090121095501-mxb7w5hi56lzp1jr) (merge vers: 5.1.31-ndb-6.4.1) (pib:6)

Description:
In the ndbmtd, one thread could flood another thread, which would cause the
system to stop with job-buffer-full (impl. as an abort currently)

This has been prevented, by before start executing signals, one computes how many signals threads in system can accept, and only execute if space is found.

The flood could be provoked by committing/aborting a large (>50k rows) on a *single datanode* ndbmtd

Bugfix documented in the NDB-6.4.1 changelog as follows:

        When using ndbmtd, one thread could flood another thread, which
        would cause the system to stop with a job buffer full condition
        (currently implemented as an abort). This could be caused by
        committing or aborting a large transaction (50000 rows or more)
        on a single data node running ndbmtd. To prevent this from
        happening, the number of signals that can be accepted by the
        system threads is calculated before excuting them, and only
        executing if sufficient space is found.