Bug #48560 Shutdown reporter shuts down mysqld before validators are done
Submitted: 5 Nov 2009 10:35 Modified: 9 Jan 2015 14:39
Reporter: John Embretsen Email Updates:
Status: Won't fix Impact on me:
None 
Category:Tools: Random Query Generator Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: Bernt Marius Johnsen CPU Architecture:Any

[5 Nov 2009 10:35] John Embretsen
Description:
The Shutdown reporter "initiates a graceful shutdown at the end of the test.", which is needed for stability in certain automated test execution environments such as Pushbuild 2. 

(Quoting Philip Stoev: "the thing that can be turned off is the Shutdown, but it may cause PB2 to hang , depending on the phase of the moon")

The problem is that the shutdown is sometimes executed _after_ other operations. In this case the Shutdown reporter is interfering with the ResultsetComparator (or ResultsetComparatorSimplify) validator.

In order to compare results from two servers, these validators finish off by comparing mysql dumps from each server. However, if the Shutdown reporter is also used, the ResultsetComparator validators try to connect to servers that have already been shut down:

# 10:36:28 Reporters: Deadlock, ErrorLog, Backtrace, Shutdown
# 10:36:28 Validators: ResultsetComparatorSimplify
# 10:36:28 Starting 1 processes, 10000 queries each, duration 600 seconds.
# 10:36:29 Started periodic reporting process...
# 10:46:36 Child process completed successfully.
# 10:46:38 Killing periodic reporting process with pid 27965...
# 10:46:38 Shutting down the server...
# 10:46:38 Waiting for mysqld with pid 26272 to terminate...
*******
# 10:46:45 ... waiting complete.
# 10:46:45 Shutting down the server...
# 10:46:45 Waiting for mysqld with pid 26296 to terminate...
****
# 10:46:49 ... waiting complete.
# 10:46:49 Test completed successfully.
# 10:46:49 gentest.pl exited with exit status 0
# 10:46:49 Dumping server on port 11880...
mysqldump: Got error: 2003: Can't connect to MySQL server on '127.0.0.1' (146) when trying to connect
# 10:46:51 Dumping server on port 11882...
mysqldump: Got error: 2003: Can't connect to MySQL server on '127.0.0.1' (146) when trying to connect
# 10:46:52 Comparing SQL dumps...
# 10:46:52 No differences were found between servers.
Thu Nov  5 10:46:52 2009 [26256] runall.pl will exit with exit status 0

This may be an issue with the Shutdown reporter, the ResultsetComparator validator(s), both, or none of them.

How to repeat:
Example command showing this problem. Replace paths (basedir and vardir) with your own.

perl runall.pl \
--mysqld=--loose-innodb-lock-wait-timeout=5 \
--mysqld=--table-lock-wait-timeout=5 \
--mysqld=--skip-safemalloc \
--mysqld=--log-output=file \
--gendata=conf/partition_pruning.zz \
--grammar=conf/partition_pruning.yy \
--basedir1=/export/home/tmp/qauser/tmp-snapshots/mysql-6.0.14-alpha-solaris10-sparc \
--basedir2=/export/home/mysql-releases/mysql-5.0 \
--vardir1=/export/home/tmp/qauser/tmp-snapshots/mysql-6.0.14-alpha-solaris10-sparc/mysql-test/vardirs/vardir-bzr \
--vardir2=/export/home/tmp/qauser/tmp-snapshots/mysql-6.0.14-alpha-solaris10-sparc/mysql-test/vardirs/vardir-5.0 \
--threads=1 \
--queries=10000 \
--duration=600 \
--validators=ResultsetComparatorSimplify \
--reporters=Deadlock,ErrorLog,Backtrace,Shutdown

Suggested fix:
Implement special handling of the Shutdown reporter so that it is executed after other reporters and validators have finished?
[5 Nov 2009 10:37] John Embretsen
Correction to a sentence above: The problem is that the shutdown is sometimes exexuted _before_ other operations have finished.
:)
[5 Nov 2009 11:12] Philip Stoev
The final dump + compare between two servers is not a Validator or a Reporter, it is a hard-coded part at the end of the runall.pl script.

The general solution for this bug is to remove the "scripts calling scripts" paradigm and to match the starting and the stopping of the server so that they happen within the same logical block.

Currently, the server is started in runall.pl and stopped by a Reporter in gentest.pl , which is not the same logical block.

Also, there is a killall statement at the end of pb2gentest.pl in order to prevent any stray mysqlds from causing PB2 to hang. PB2 expects that all started processes have exited before it declares a test complete, not just the main process.
[5 Nov 2009 12:10] John Embretsen
On closer inspection it seems this has nothing to do with the ResultsetComparator per se, because the decision to compare db dumps is made in runall.pl, based on the number of basedirs specified (or if using replication):

#
# Compare master and slave, or two masters
#

if ($rpl_mode || (defined $basedirs[1])) {
...

Also, the comparison is made using system(diff), which probably won't work well on Windows...
[5 Nov 2009 12:11] John Embretsen
(Sorry, I did not see Philip's comment before sending in my last comment.)
[5 Nov 2009 12:29] Philip Stoev
diff.exe is available on PB2 machines, in a directory that is added to PATH in pb2gentest.pl

But John is right in principle, we should remove our dependency on diff if possible. However, MTR have a long and sad history with this. Furthermore, perl-based diff modules take a huge ammount of memory, so a C implementation would probably be inevitably needed.