Bug #49589 Job Buffer Full aborts in mt.cpp
Submitted: 10 Dec 2009 15:35 Modified: 28 Dec 2009 18:26
Reporter: Andrew Hutchings Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1-telco-7.0 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[10 Dec 2009 15:35] Andrew Hutchings
Description:
When a job buffer full condition occurs the node fails with abort().  This does not give us a lot of diagnostic information to work with beyond the fact that "job buffer full" caused the failure.

How to repeat:
No test case yet, look at kernel/vm/mt.cpp:

static
void
job_buffer_full()
{
  ndbout_c("job buffer full");
  abort();
}

static
thr_job_buffer*
seize_buffer(struct thr_repository* rep, int thr_no, bool prioa)
{
...
        if (unlikely(cnt == 0))
        {
          job_buffer_full();
        }

Suggested fix:
1. Ideally the node would not fail at all, but this may be a needed step.
2. Change to a proper node failure with more diagnostic information in the node log file if possible,
[21 Dec 2009 11:55] Jonas Oreland
Fix

Attachment: bug49589.tgz (application/x-compressed-tar, text), 2.65 KiB.

[21 Dec 2009 14:30] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/95251

3301 Jonas Oreland	2009-12-21
      ndb - bug#49589 - change Dbtc not to send too many commit/complete which can lead to job-buffer-full specially for ndbmtd with 1-datanode
[21 Dec 2009 14:37] Jonas Oreland
pushed to 7.0.11
[21 Dec 2009 14:37] Bugs System
Pushed into 5.1.41-ndb-7.1.0 (revid:jonas@mysql.com-20091221143437-5yoz2xr89h6diz7b) (version source revid:jonas@mysql.com-20091221143437-5yoz2xr89h6diz7b) (merge vers: 5.1.41-ndb-7.1.0) (pib:15)
[28 Dec 2009 18:26] Jon Stephens
Documented bugfic in the NDB-7.0.11 changelog as follows:

        Under some circumstances, the DBTC kernel block could send an
        excessive number of commit and complete messages which could
        lead to a the job buffer filling up and node failure. This was
        especially likely to occur when using ndbmtd with a single data 
        node.

Closed.