Bug #53456 ndbd angel does not check for fork() errors
Submitted: 6 May 2010 7:10 Modified: 4 Nov 2010 14:13
Reporter: Hartmut Holzgraefe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-5.1-telco-6.3 OS:Any
Assigned to: Magnus Blåudd CPU Architecture:Any
Tags: mysql-cluster-6.3.24 (all cluster)

[6 May 2010 7:10] Hartmut Holzgraefe
Description:
Looking at the angel_run() function in storage/ndb/src/kernel/angel.cc

    274   while (true)
    275   {
    ...
    297     if ((child=fork()) <= 0)
    298       break; // child or error
    ...
    394   }

the only way to leave the while loop without terminating the process
or returning control to angel_run()s caller is a zero or negative
fork() result, zero indicating that the current process is the fork
child and negative values meaning that the fork() failed

After the while(true) loop we have the following checks before the
function returns:
   
    396   if (child >= 0)
    397     g_eventLogger->info("Angel pid: %d ndb pid: %d", getppid(), getpid());
    398   else if (child > 0)
    399     g_eventLogger->info("Ndb pid: %d", getpid());
 
As we are leaving the loop on zero or negative fork() results only
a >0 value of "child" can never happen here. And even if >0 values
were possible the "else if" path would never be entered as any 
value >0 would already be caught by the first "if".

Negative error code values are never checked for, so if the fork()
of a new worker process fails the angel process either just dies
silently or transforms itself from angel to worker so that there
is no active angel anymore and any further node failure will then
no longer lead to a restart even with StopOnError=false

How to repeat:
see code in src/kernel/angel.cc

Suggested fix:
- fix fork() return value checking logic
- check for and handle negative fork() result values
  - log failed fork() attempts and the error code returned
  - e.g. take failed fork() as failed startup and retry,
    maybe after short delay, and count failed fork()
    attempts in failed_startups, or add a separate 
    counter for these
[10 May 2010 18:08] Magnus Blåudd
Will implement a retry_fork function that will retry the fork 10 times with 1 second interval before gving up. The retries as well as the final give up will be logged to stdout.
[11 May 2010 6:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/107912
[31 Aug 2010 6:03] Jonas Oreland
is this pushed ?
[3 Nov 2010 11:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/122644

3922 Magnus Blåudd	2010-11-03 [merge]
      Bug#53456 ndbd angel does not check for fork() errors
       - updated for 7.0+
[3 Nov 2010 11:48] Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:magnus.blaudd@sun.com-20101103110050-28u6qhfgytvb39ur) (version source revid:magnus.blaudd@sun.com-20101103110050-28u6qhfgytvb39ur) (merge vers: 5.1.51-ndb-6.3.39) (pib:21)
[3 Nov 2010 11:48] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:magnus.blaudd@sun.com-20101103113445-vukzn3vw89ioqfo8) (version source revid:magnus.blaudd@sun.com-20101103113445-vukzn3vw89ioqfo8) (merge vers: 5.1.51-ndb-7.0.20) (pib:21)
[3 Nov 2010 11:50] Magnus Blåudd
Pushed to 6.3.39, 7.0.20 and 7.1.9
[4 Nov 2010 14:13] Jon Stephens
Documented bugfix in the NDB-6.3.39, 7.0.20, and 7.1.9 changelogs, as follows:

        When a data node angel process failed to fork off a new worker
        process (to replace one that had failed), the failure was not
        handled. This meant that the angel process either transformed
        itself into a worker process, or itself failed. In the first
        case, the data node continued to run, but there was no longer
        any angel to restart it in the event of failure, even with
        StopOnError set to 0.

Closed.