MySQL Bugs: #58718: Second rpl test sporadically fails with error 1220

Bug #58718	Second rpl test sporadically fails with error 1220
Submitted:	3 Dec 2010 18:53	Modified:	8 Feb 2011 10:19
Reporter:	Nirbhay Choubey	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Tools: MTR / mysql-test-run	Severity:	S3 (Non-critical)
Version:	mysql-5.5, 5.6.1	OS:	Any
Assigned to:	Bjørn Munch	CPU Architecture:	Any
Tags:	regression

Description:
When running 2 replication tests in succession, the second rpl test
sporadically fails giving error 1220.

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
rpl_bug.rpl_1                            [ pass ]  10542

rpl_bug.rpl_2                            [ fail ]
        Test ended at 2010-12-03 19:33:42

CURRENT_TEST: rpl_bug.rpl_2
mysqltest: In included file "./include/show_rpl_debug_info.inc": 
At line 82: query 'SHOW BINLOG EVENTS IN '$master_binlog_name_sql'' failed: 1220: Error when executing command SHOW BINLOG EVENTS: Could not find
target log

The result from queries just before the failure was:
< snip >
[on master]

SELECT NOW();
NOW()
2010-12-03 21:33:41
**** SHOW MASTER STATUS on master ****
SHOW MASTER STATUS;
File    master-bin.000001
Position        107
Binlog_Do_DB
Binlog_Ignore_DB

**** SHOW PROCESSLIST on master ****
SHOW PROCESSLIST;
Id      User    Host    db      Command Time    State   Info
10      root    localhost       test    Sleep   301             NULL
11      root    localhost:50117 test    Query   0       NULL    SHOW PROCESSLIST
12      root    localhost:50118 test    Sleep   301             NULL

**** SHOW BINLOG EVENTS on master ****  

Note : 2nd test fails only when 1st test's internal check
       passes (and this is happening in a *random* fashion).  

How to repeat:
perl mtr --suite=bug rpl_1 rpl_2

Testcase for this bug.

Attachment: bug.tar.gz (application/x-gzip, text), 509 bytes.

Some rpl test will have to restart the server before and/or after they have been run. The reason this doesn't fail if the "internal check" for the first check fails, is that this results in a server restart.

If I try the supplied example, I have to use --nocheck-testcase and then the second test just hangs. So I can't say what is the exact cause in this case.

Bug #49978 are adding some cleanup to rpl tests. But if a test still needs to restart the server after it's run, it can add this at any point in the test:

call mtr.force_restart();

If a test has to start on a fresh server for some reason, add this to the <test>-master.opt file:

--force-restart

When this happens, it's usually the first test that leaves the DB in a state which affects the next test. There is nothing mtr/mysqltest can do about that.

In this example, experiments show that it's the change master in the first test that does it; if I comment that out the second test also works.

In general, each test should if possible reset the state. Bug #49978 is fixing some of that. If that's not possible, a restart may be forced by "call mtr.force_restart();" as mentioned previously.

Changing category to Tests/Replication.

Thank you for the report.

Verified as described. Not repeatable with 5.1

I don't think there is a bug here. In the test presented by Nirbhay
there is a change in replication setup data, in particular the
following:
 
 1. different rpl user ('rpl'), used to connect to master, is created
 2. replication slave threads are stopped

 3. IO thread connection details are changed so the it now uses the
    'rpl' user: CHANGE MASTER

 4. replication is started again
 5. the 'rpl' user is dropped on the master
 6. replication slave is stopped
 7. test file ends

This means that the replication test did not reset the connection data
and when MTR starts the second test, it will fail to start the IO
thread. Should the test writer had reset the slave data used to
connect to the master, in rpl_1, then there would be no issue at all
when MTR sets rpl_2 to execute.

In fact, this probably has nothing to do with MTR running a second
rpl_2 test case after rpl_1. For instance, things could go awfully
wrong if we did include a subtest in rpl_1 after the existing test
instructions and without reseting replication connection data. For
example, by just adding the following two lines after the last 'stop
slave;':

  start slave;
  -- source include/wait_for_slave_io_to_start.inc

I think that as Bjorn states, for such functional/structural rpl
changes, either the test writer deals with the need to reset the
slave's state or forces a server restart (in such a way that it
implicitly resets the slave's data).

Nirbhay, were you thinking on something more specific that I 
failed to spot ?

8<8<8<8<8<
<nirbhay> And now I see 'Timeout in include/wait_for_slave_param.inc'
          failure in the 2nd test.
<luis> yes
<nirbhay> Possibly due to after effects of fix for bug#49978.
<luis> possibly
<luis> but in the 1st rpl test, the replication topology is
       effectively broken and there is no way that MTR will
       notice that
<luis> restoring the topology is responsibility of the test
       writer, so that further tests are not affected
<nirbhay> I see.
<luis> the problem here is that the replication setup was broken
       by the tester and there is no way to recover unless the test
       writer resets it or MTR is instructed to restart servers from
       scratch (with defaults)
<nirbhay> I was thinking of, if there is a way for MTR to sense such
       broken topology, and automatically force_restart from the next
       test.
<luis> right, but then you would have to ask bjorn that he implements
       something like checktestcase for MTR  so that it would check 
       SHOW SLAVE STATUS output and if it was not according to the 
       expected, then it would restart the servers
<luis> not sure how feasible it is though..
8<8<8<8<8<

Bjorn,
    Is there a way for MTR to sense and force a restart for
    the subsequent test, if it executes a test written in a 
    way that breaks the replication topology (as rpl_1.test)? 
   (Instead of making the test writer to conform to some correct
    rpl test format)

To the last comment: no, that would require MTR itself to have much more detailed knowledge that I think it ought to have, and it would have to do that for every test.

What might be possible, is to write some test code for checking topology and then do "call mtr.force_restart()" only if necessary.

It shouldn't be MTR's responsibility to check the database for any possible inconsistency or change after a test run. If this is something that's not covered by the general "check-testcase" then it would need to be coded in the test itself (or in some common include file used by several tests).

Any test that is thought likely to cause trouble for the following tests even when successful, can avoid the trouble by forcing a restart, as explained in a previous comment.

Closing to get it off my list