Bug #21721 Test suite does not start with NDB, hangs forever; problem around "ndb_mgmd"
Submitted: 18 Aug 2006 14:27 Modified: 14 Sep 2006 3:45
Reporter: Joerg Bruehe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Tests Severity:S7 (Test Cases)
Version:5.1.12-pre OS:Any (all)
Assigned to: Magnus Blåudd CPU Architecture:Any

[18 Aug 2006 14:27] Joerg Bruehe
Description:
This was detected in a test build of the current 5.1 tree, pulled 2006-Aug-16.

Symptom is that builds and tests are (rather) ok until it reaches the first test run with NDB.
When this run starts, it hangs forever.
Last lines in the log:

=====
Running tests using PS + RBR + NDB
perl ./mysql-test-run.pl --ps-protocol --comment="ps+rowrepl+NDB" --mysqld=--binlog-format=row --tmpdir=/export/spare/mysqldev/tmp/my_build-sunfire100c  --master_port=3334 --slave_port=3350  --timer --force --im-port=3337 --im-mysqld1-port=3339 --im-mysqld2-port=3341 --with-ndbcluster --ndbcluster_port=9350
.....
Removing Stale Files
Installing Master Database
Installing Master Database
Installing Slave Database
Installing Slave Database
Installing Slave Database
Creating IM password file (/export/spare/mysqldev/sunfire100c/test/mysql-5.1.12-beta-solaris8-sparc/mysql-test/var/im.passwd)
Installing Im_mysqld_1 Database
Installing Im_mysqld_2 Database
Installing Master Cluster
=====

In this state, it hangs until the test script is killed.

The test script has no child processes in this moment.
Retrying manually, I could reproduce that hang.

Running it "--verbose", I could see it loop with calls to "ndbcluster_wait_started()".
Adding debugging output proved it was with the "--no-contact" parameter, so it is the loop in "ndb_mgmd_start()".

This part of the script was modified after the last 5.1 build.

How to repeat:
Build and test with NDB on any platform.

Suggested fix:
1) In "mysql-test-run.pl", function "ndb_mgmd_start()", the call
      $pid= mtr_spawn($exe_ndb_mgmd, ...);
   does not check for any errors at all. This seems risky.

2) Some lines down, the loop
      while (ndbcluster_wait_started($cluster, "--no-contact"))
      {
        select(undef, undef, undef, 0.1);
      }
   will never terminate as long as "ndbcluster_wait_started()" returns non-zero.
   Most likely, it needs a counter limit.

3) The value which "ndb_mgmd_start()" returns to "ndbcluster_start()"
      my $pid= ndb_mgmd_start($cluster);
   is neither used nor analyzed

Of course, these fixes just correct the hang, not the underlying problem.
But the hang prevents automated test sequences, so it must be prevented.
[18 Aug 2006 15:10] Joerg Bruehe
I forgot the contents of the log file:

=====
> cat var/ndbcluster-9350/ndb_waiter.log

Connecting to mgmsrv at host=localhost:9350
Unable to connect with connect string: nodeid=0,localhost:9350
latest_error=1011, line=472
Connection to host=localhost:9350 failed

NDBT_ProgramExit: 1 - Failed

>
=====

This appears in the file only after I kill "mysql-test-run.pl",
so either Solaris buffered it ("sunfire100c", Solaris 8),
or it was still kept in some running process (not shown by "ps -ft pts/1").
[18 Aug 2006 15:47] Joerg Bruehe
Detected another omission: The start of "ndb_mgmd" writes a log, here is that of "bsd60-64":

var/ndbcluster-9350/master_ndb_mgmd.log
=====
mysql-test-run: *** ERROR(child): failed to execute "/usr/home/mysqldev/bsd60-64/test/mysql-5.1.12-beta-freebsd6.0-x86_64/l
ibexec/ndb_mgmd": No such file or directory
=====

This gave the essential hint: 
"ndb_mgmd" is now in subdirectory "bin" !

I fixed "mysql-test-run.pl" accordingly, for both "ndb_mgmd" and "ndbd", and the test suite started fine ...
[18 Aug 2006 16:24] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/10627

ChangeSet@1.2285, 2006-08-18 18:24:38+02:00, joerg@trift2. +1 -0
  mysql-test-run.pl  :  Fix the search path for "ndb_mgmd" and "ndbd".  bug#21721
[18 Aug 2006 16:40] Joerg Bruehe
I fixed the wrong pathes for "ndb_mgmd" and "ndbd" in "mysql-test-run.pl",
so the immediate problem is solved.

But still, the missing error checks, infnite loop etc remain,
which should be fixed to prevent such hangs for the future.
To make this happen, I just set this report to lower priority but leave it "verified".
[22 Aug 2006 8:17] Jonas Oreland
lower prio and remove showstopper
  given that it no longer blocks build/test 
  and only is bug in mysql-test-run.pl

also changing category to "tests"
  (and resets lead, as i dont know how is lead of these bug-reports)
[31 Aug 2006 8:29] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/11145

ChangeSet@1.2296, 2006-08-31 10:28:48+02:00, msvensson@shellback.(none) +1 -0
  Bug#21721 Test suite does not start with NDB, hangs forever; problem around "ndb_mgmd"
   - Wait for ndb_mgmd with timeout
[31 Aug 2006 8:34] Magnus Blåudd
>1) In "mysql-test-run.pl", function "ndb_mgmd_start()", the call
>      $pid= mtr_spawn($exe_ndb_mgmd, ...);
>   does not check for any errors at all. This seems risky.

Added  check in mtr_spawn that will abort if we are trying to spawn a non existing "path to exe" 

>2) Some lines down, the loop
>      while (ndbcluster_wait_started($cluster, "--no-contact"))
>      {
>        select(undef, undef, undef, 0.1);
>      }
>   will never terminate as long as "ndbcluster_wait_started()" returns
>non-zero.
>   Most likely, it needs a counter limit.

Added a function that will wait with timeout.

>3) The value which "ndb_mgmd_start()" returns to "ndbcluster_start()"
>      my $pid= ndb_mgmd_start($cluster);

Removed the "my $pid" part. With the above, it will either be started or not...
[13 Sep 2006 8:51] Timothy Smith
Pushed to 5.1.12
[14 Sep 2006 3:45] Paul DuBois
Test suite change. No changelog entry needed.