| Bug #53456 | ndbd angel does not check for fork() errors | ||
|---|---|---|---|
| Submitted: | 6 May 2010 7:10 | Modified: | 4 Nov 2010 14:13 |
| Reporter: | Hartmut Holzgraefe | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
| Version: | mysql-5.1-telco-6.3 | OS: | Any |
| Assigned to: | Magnus Blåudd | CPU Architecture: | Any |
| Tags: | mysql-cluster-6.3.24 (all cluster) | ||
[10 May 2010 18:08]
Magnus Blåudd
Will implement a retry_fork function that will retry the fork 10 times with 1 second interval before gving up. The retries as well as the final give up will be logged to stdout.
[11 May 2010 6:42]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/107912
[31 Aug 2010 6:03]
Jonas Oreland
is this pushed ?
[3 Nov 2010 11:36]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/122644 3922 Magnus Blåudd 2010-11-03 [merge] Bug#53456 ndbd angel does not check for fork() errors - updated for 7.0+
[3 Nov 2010 11:48]
Bugs System
Pushed into mysql-5.1-telco-6.3 5.1.51-ndb-6.3.39 (revid:magnus.blaudd@sun.com-20101103110050-28u6qhfgytvb39ur) (version source revid:magnus.blaudd@sun.com-20101103110050-28u6qhfgytvb39ur) (merge vers: 5.1.51-ndb-6.3.39) (pib:21)
[3 Nov 2010 11:48]
Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:magnus.blaudd@sun.com-20101103113445-vukzn3vw89ioqfo8) (version source revid:magnus.blaudd@sun.com-20101103113445-vukzn3vw89ioqfo8) (merge vers: 5.1.51-ndb-7.0.20) (pib:21)
[3 Nov 2010 11:50]
Magnus Blåudd
Pushed to 6.3.39, 7.0.20 and 7.1.9
[4 Nov 2010 14:13]
Jon Stephens
Documented bugfix in the NDB-6.3.39, 7.0.20, and 7.1.9 changelogs, as follows:
When a data node angel process failed to fork off a new worker
process (to replace one that had failed), the failure was not
handled. This meant that the angel process either transformed
itself into a worker process, or itself failed. In the first
case, the data node continued to run, but there was no longer
any angel to restart it in the event of failure, even with
StopOnError set to 0.
Closed.

Description: Looking at the angel_run() function in storage/ndb/src/kernel/angel.cc 274 while (true) 275 { ... 297 if ((child=fork()) <= 0) 298 break; // child or error ... 394 } the only way to leave the while loop without terminating the process or returning control to angel_run()s caller is a zero or negative fork() result, zero indicating that the current process is the fork child and negative values meaning that the fork() failed After the while(true) loop we have the following checks before the function returns: 396 if (child >= 0) 397 g_eventLogger->info("Angel pid: %d ndb pid: %d", getppid(), getpid()); 398 else if (child > 0) 399 g_eventLogger->info("Ndb pid: %d", getpid()); As we are leaving the loop on zero or negative fork() results only a >0 value of "child" can never happen here. And even if >0 values were possible the "else if" path would never be entered as any value >0 would already be caught by the first "if". Negative error code values are never checked for, so if the fork() of a new worker process fails the angel process either just dies silently or transforms itself from angel to worker so that there is no active angel anymore and any further node failure will then no longer lead to a restart even with StopOnError=false How to repeat: see code in src/kernel/angel.cc Suggested fix: - fix fork() return value checking logic - check for and handle negative fork() result values - log failed fork() attempts and the error code returned - e.g. take failed fork() as failed startup and retry, maybe after short delay, and count failed fork() attempts in failed_startups, or add a separate counter for these