Bug #42052 ndbd - Received signal 6. Running error handler
Submitted: 12 Jan 2009 14:33 Modified: 12 Oct 2009 9:47
Reporter: Gerhard Fürnkranz Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-6.4 OS:Solaris (Solaris 10 / Sparc)
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: 6.4

[12 Jan 2009 14:33] Gerhard Fürnkranz
Description:
During execution of a query like

UPDATE table1 AS s,
       table1 AS r
   SET s.field1 = r.field1
 WHERE r.field2 = s.field2;

(which is a rather lager transaction updating
about 85000 records) ndbd crashed.

Cut-out from log file:

[...]
2009-01-07 14:54:40 [ndbd] INFO     -- Received signal 6. Running error handler.
2009-01-07 14:54:41 [ndbd] INFO     -- Signal 6 received; Abort
2009-01-07 14:54:41 [ndbd] INFO     -- main.cpp
2009-01-07 14:54:41 [ndbd] INFO     -- Error handler signal shutting down system
2009-01-07 14:54:41 [ndbd] INFO     -- Error handler shutdown completed - exitin
g
2009-01-07 14:54:41 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. I
nitiated by signal 6. Caused by error 6000: 'Error OS signal received(Internal e
rror, programming error or missing error message, please report a bug). Temporar
y error, restart node'.

How to repeat:
The problem seems to be reprodicuble in our environment at any time (I've no idea however, how to repeat it isolated, outside of our environment).
[12 Jan 2009 14:37] Gerhard Fürnkranz
Log files

Attachment: ndb_server_crash.tar.gz (application/x-gzip, text), 174.25 KiB.

[12 Jan 2009 15:41] Tomas Ulin
Gerhard,

so we have a guess what this problem is.  But to verify this we would like to see a backtrace from the core that you get.

Please let us know if you need help on how to get the backtrace.

BR,

Tomas
[12 Jan 2009 16:02] Gerhard Fürnkranz
Sorry, I did not find any core. On the machine all core files are directed to the /TspCore directory, but unfortunately I did not find any core from ndbd there.

# coreadm
     global core file pattern:
     global core file content: default
       init core file pattern: /TspCore/core.%f.%p.%t
       init core file content: default
            global core dumps: disabled
       per-process core dumps: enabled
      global setid core dumps: disabled
 per-process setid core dumps: enabled
     global core dump logging: disabled
[13 Jan 2009 10:13] Tomas Ulin
Gerhard,

is it possible for you to configure your system so that you get core's?

BR,

Tomas
[13 Jan 2009 11:03] Gerhard Fürnkranz
On the system we certainly do get core dumps from other processes, but we don't get a core dump from ndbd - so far I was not able to figure out why. We'll try to run ndbd under control of dbx in order to get a stack backtrace from the crash.

-Gerhard
[13 Jan 2009 11:15] Gerhard Fürnkranz
Stack backtrace

Attachment: stack_backtrace1.txt (text/plain), 20.53 KiB.

[13 Jan 2009 12:40] Tomas Ulin
Gerhard,

Thank you very much, it verifies that it is the error we were thinking about.  We know what it is and how to fix it.  It is triggered by large transactions such as yours. Hopefully we will be able to get it into the next release of 6.4.

BR,

Tomas
[14 Jan 2009 19:44] Tomas Ulin
To clarify, this is a problem with all large transactions in 6.4, whether it be updates, deletes, or inserts.  All will be addressed with the same bug fix.

BR,

Tomas
[21 Jan 2009 9:55] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/63670

3223 Jonas Oreland	2009-01-21
      ndbmtd - 
        1) OJ optimizations developed for cmt/bw
        2) pessemstic scheduling (update_sched_config), bug#42052
           only execute signals if space exist to send to 
           other-threads in system
        3) new NdbCondition_ComputeAbsTime/NdbCondition_WaitTimeoutAbs
[21 Jan 2009 9:55] Bugs System
Pushed into 5.1.31-ndb-6.4.1 (revid:jonas@mysql.com-20090121095501-mxb7w5hi56lzp1jr) (version source revid:jonas@mysql.com-20090121095501-mxb7w5hi56lzp1jr) (merge vers: 5.1.31-ndb-6.4.1) (pib:6)
[21 Jan 2009 10:03] Jonas Oreland
Description:
In the ndbmtd, one thread could flood another thread, which would cause the
system to stop with job-buffer-full (impl. as an abort currently)

This has been prevented, by before start executing signals, one computes how many signals threads in system can accept, and only execute if space is found.

The flood could be provoked by committing/aborting a large (>50k rows) on a *single datanode* ndbmtd
[21 Jan 2009 13:09] Jon Stephens
Bugfix documented in the NDB-6.4.1 changelog as follows:

        When using ndbmtd, one thread could flood another thread, which
        would cause the system to stop with a job buffer full condition
        (currently implemented as an abort). This could be caused by
        committing or aborting a large transaction (50000 rows or more)
        on a single data node running ndbmtd. To prevent this from
        happening, the number of signals that can be accepted by the
        system threads is calculated before excuting them, and only
        executing if sufficient space is found.