| Bug #21721 | Test suite does not start with NDB, hangs forever; problem around "ndb_mgmd" | ||
|---|---|---|---|
| Submitted: | 18 Aug 2006 14:27 | Modified: | 14 Sep 2006 3:45 |
| Reporter: | Joerg Bruehe | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Tests | Severity: | S7 (Test Cases) |
| Version: | 5.1.12-pre | OS: | Any (all) |
| Assigned to: | Magnus Blåudd | CPU Architecture: | Any |
[18 Aug 2006 15:10]
Joerg Bruehe
I forgot the contents of the log file:
=====
> cat var/ndbcluster-9350/ndb_waiter.log
Connecting to mgmsrv at host=localhost:9350
Unable to connect with connect string: nodeid=0,localhost:9350
latest_error=1011, line=472
Connection to host=localhost:9350 failed
NDBT_ProgramExit: 1 - Failed
>
=====
This appears in the file only after I kill "mysql-test-run.pl",
so either Solaris buffered it ("sunfire100c", Solaris 8),
or it was still kept in some running process (not shown by "ps -ft pts/1").
[18 Aug 2006 15:47]
Joerg Bruehe
Detected another omission: The start of "ndb_mgmd" writes a log, here is that of "bsd60-64": var/ndbcluster-9350/master_ndb_mgmd.log ===== mysql-test-run: *** ERROR(child): failed to execute "/usr/home/mysqldev/bsd60-64/test/mysql-5.1.12-beta-freebsd6.0-x86_64/l ibexec/ndb_mgmd": No such file or directory ===== This gave the essential hint: "ndb_mgmd" is now in subdirectory "bin" ! I fixed "mysql-test-run.pl" accordingly, for both "ndb_mgmd" and "ndbd", and the test suite started fine ...
[18 Aug 2006 16:24]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/10627 ChangeSet@1.2285, 2006-08-18 18:24:38+02:00, joerg@trift2. +1 -0 mysql-test-run.pl : Fix the search path for "ndb_mgmd" and "ndbd". bug#21721
[18 Aug 2006 16:40]
Joerg Bruehe
I fixed the wrong pathes for "ndb_mgmd" and "ndbd" in "mysql-test-run.pl", so the immediate problem is solved. But still, the missing error checks, infnite loop etc remain, which should be fixed to prevent such hangs for the future. To make this happen, I just set this report to lower priority but leave it "verified".
[22 Aug 2006 8:17]
Jonas Oreland
lower prio and remove showstopper given that it no longer blocks build/test and only is bug in mysql-test-run.pl also changing category to "tests" (and resets lead, as i dont know how is lead of these bug-reports)
[31 Aug 2006 8:29]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/11145 ChangeSet@1.2296, 2006-08-31 10:28:48+02:00, msvensson@shellback.(none) +1 -0 Bug#21721 Test suite does not start with NDB, hangs forever; problem around "ndb_mgmd" - Wait for ndb_mgmd with timeout
[31 Aug 2006 8:34]
Magnus Blåudd
>1) In "mysql-test-run.pl", function "ndb_mgmd_start()", the call
> $pid= mtr_spawn($exe_ndb_mgmd, ...);
> does not check for any errors at all. This seems risky.
Added check in mtr_spawn that will abort if we are trying to spawn a non existing "path to exe"
>2) Some lines down, the loop
> while (ndbcluster_wait_started($cluster, "--no-contact"))
> {
> select(undef, undef, undef, 0.1);
> }
> will never terminate as long as "ndbcluster_wait_started()" returns
>non-zero.
> Most likely, it needs a counter limit.
Added a function that will wait with timeout.
>3) The value which "ndb_mgmd_start()" returns to "ndbcluster_start()"
> my $pid= ndb_mgmd_start($cluster);
Removed the "my $pid" part. With the above, it will either be started or not...
[13 Sep 2006 8:51]
Timothy Smith
Pushed to 5.1.12
[14 Sep 2006 3:45]
Paul DuBois
Test suite change. No changelog entry needed.

Description: This was detected in a test build of the current 5.1 tree, pulled 2006-Aug-16. Symptom is that builds and tests are (rather) ok until it reaches the first test run with NDB. When this run starts, it hangs forever. Last lines in the log: ===== Running tests using PS + RBR + NDB perl ./mysql-test-run.pl --ps-protocol --comment="ps+rowrepl+NDB" --mysqld=--binlog-format=row --tmpdir=/export/spare/mysqldev/tmp/my_build-sunfire100c --master_port=3334 --slave_port=3350 --timer --force --im-port=3337 --im-mysqld1-port=3339 --im-mysqld2-port=3341 --with-ndbcluster --ndbcluster_port=9350 ..... Removing Stale Files Installing Master Database Installing Master Database Installing Slave Database Installing Slave Database Installing Slave Database Creating IM password file (/export/spare/mysqldev/sunfire100c/test/mysql-5.1.12-beta-solaris8-sparc/mysql-test/var/im.passwd) Installing Im_mysqld_1 Database Installing Im_mysqld_2 Database Installing Master Cluster ===== In this state, it hangs until the test script is killed. The test script has no child processes in this moment. Retrying manually, I could reproduce that hang. Running it "--verbose", I could see it loop with calls to "ndbcluster_wait_started()". Adding debugging output proved it was with the "--no-contact" parameter, so it is the loop in "ndb_mgmd_start()". This part of the script was modified after the last 5.1 build. How to repeat: Build and test with NDB on any platform. Suggested fix: 1) In "mysql-test-run.pl", function "ndb_mgmd_start()", the call $pid= mtr_spawn($exe_ndb_mgmd, ...); does not check for any errors at all. This seems risky. 2) Some lines down, the loop while (ndbcluster_wait_started($cluster, "--no-contact")) { select(undef, undef, undef, 0.1); } will never terminate as long as "ndbcluster_wait_started()" returns non-zero. Most likely, it needs a counter limit. 3) The value which "ndb_mgmd_start()" returns to "ndbcluster_start()" my $pid= ndb_mgmd_start($cluster); is neither used nor analyzed Of course, these fixes just correct the hang, not the underlying problem. But the hang prevents automated test sequences, so it must be prevented.