Bug #46320 CPCD: checking if processes are running is fragile, and depending on scheduling
Submitted: 21 Jul 2009 11:59 Modified: 10 Feb 2010 4:17
Reporter: Jørgen Austvik Email Updates:
Status: Won't fix Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: Assigned Account CPU Architecture:Any

[21 Jul 2009 11:59] Jørgen Austvik
Description:
In CPCD::Process::start():

 switch(m_processType){
  case TEMPORARY:{
  ..
  do_exec();
  ..
  if(isRunning()){
    m_status = RUNNING;
    return 0;
  }
  m_status = STOPPED;
  return -1;

ATRT log:

2009-07-20 19:12:25 [ndb_atrt] DEBUG    -- system(atrt-setup.sh tyr16 /export/home2/tmp/ndb/cr_autotest/run-2-dn-mysql-5.1-telco-6.3/run/ndb_api.1/ /export/home2/tmp/ndb/cr_autotest/run-2-dn-mysql-5.1-telco-6.3/run/ndb_api.1)
2009-07-20 19:12:26 [ndb_atrt] ERROR    -- Unable to start process: Failed to start

CPCD log:

2009-07-20 19:12:26 [ndb_cpcd] INFO     -- Starting 2906: 4-ndb_api
2009-07-20 19:12:26 [ndb_cpcd] DEBUG    -- Started temporary 2906 : pid=343
2009-07-20 19:12:27 [ndb_cpcd] DEBUG    -- kill pid: -343 : No such process
2009-07-20 19:12:27 [ndb_cpcd] INFO     -- Process list saved as 'ndb_cpcd.db'
2009-07-20 19:12:27 [ndb_cpcd] DEBUG    -- Sent SIGTERM to pid -326
2009-07-20 19:12:27 [ndb_cpcd] INFO     -- Process list saved as 'ndb_cpcd.db'
2009-07-20 19:12:27 [ndb_cpcd] CRITICAL -- Stopping process with bogus pid: -1 id: 7403

What we see is that the process is started and get PID 343. It then exits before we execute isRunning(). The isRunning( called around Process.cpp:436 returns false-1 (the process is not running any more), and CPCD::startProcess() returns "Failed to start" to the client.

How to repeat:
Use cpcd to start short lived process

Suggested fix:
I think calling isRunning() (at least for temporary processes) is wrong, since the process could be scheduled in and exit before the check.

Furthermore I think it would be good if CPCD returned a better error message, along the lines of "Failed to start '%s' in directory: '%s'. Exit status: %d"
[28 Sep 2009 4:40] jack andrews
i noticed this while porting cpcd to windows.
as i can reproduce this easily, i'll see if
there's an easy fix.
[1 Feb 2010 10:44] jack andrews
what do you think of something like this?

=== modified file 'storage/ndb/src/cw/cpcd/CPCD.hpp'
--- storage/ndb/src/cw/cpcd/CPCD.hpp	2010-01-29 14:42:03 +0000
+++ storage/ndb/src/cw/cpcd/CPCD.hpp	2010-02-01 10:33:39 +0000
@@ -56,7 +56,8 @@
   STOPPED  = 0,
   STARTING = 1,
   RUNNING  = 2,
-  STOPPING = 3
+  STOPPING = 3,
+  STARTED  = 4
 };
 
 enum ProcessType {

=== modified file 'storage/ndb/src/cw/cpcd/Process.cpp'
--- storage/ndb/src/cw/cpcd/Process.cpp	2010-01-29 14:42:03 +0000
+++ storage/ndb/src/cw/cpcd/Process.cpp	2010-02-01 10:33:39 +0000
@@ -659,6 +659,13 @@
     m_status = RUNNING;
     return 0;
   }
+
+  if(m_status == STARTED)
+  {
+    m_status = STOPPED;
+    return 0
+  }
+  
   m_status = STOPPED;
 
   return -1;
[3 Feb 2010 5:40] jack andrews
returning -1 means process starting failed.
but a process can start and finish before 
getting to this code.

--- storage/ndb/src/cw/cpcd/Process.cpp	2010-01-29 14:42:03 +0000
+++ storage/ndb/src/cw/cpcd/Process.cpp	2010-02-03 04:40:07 +0000
@@ -657,11 +657,14 @@ CPCD::Process::start() {
 
   if(isRunning()){
     m_status = RUNNING;
-    return 0;
   }
-  m_status = STOPPED;
+  else
+  {
+    logger.debug("process %d (pid %d) ran for a short time", m_id, m_pid);
+    m_status = STOPPED;
+  }
 
-  return -1;
+  return 0;
 }
 
 void
[3 Feb 2010 10:44] Magnus Blåudd
Looks promising but how do we know if the process failed to start? I.e was not short lived.
[10 Feb 2010 4:17] jack andrews
it appears that this 'feature' [short running processes
are considered failed processes] does not affect the
running of autotest.  

if the feature appears later in autotest or in other
uses of cpcd, we can look at this again.

i had a prelimiary patch.  essentially:
  . on linux, if exec returns in the child process,
    write a file named "[pid or id].failed".
    in CPCD::Process::start(), if isRunning()
    is false, check for existence of .failed
    file, returning -1, otherwise STOPPED.
  . on windows, fix is trivial.
[13 Mar 2014 13:37] Omer Barnir
This bug is not scheduled to be fixed at this time.