MySQL Bugs: #59037: MASTER_POS_WAIT may return prematurely after CHANGE MASTER TO RELAY_LOG

Bug #59037	MASTER_POS_WAIT may return prematurely after CHANGE MASTER TO RELAY_LOG_POS
Submitted:	19 Dec 2010 13:53	Modified:	13 Dec 2012 18:03
Reporter:	Sven Sandberg	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.1+	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	CHANGE MASTER, disabled, master_pos_wait

Description:
The function MASTER_POS_WAIT waits for the slave SQL thread to execute up to a given position. The position is given in master binlog coordinates.

The statement CHANGE MASTER TO RELAY_LOG_POS sets the read position for the SQL thread. Here, the position is given in slave relay log coordinates.

Suppose the following happens:
 1. CHANGE MASTER TO RELAY_LOG_POS is executed in such a way that the
    position is moved back in the relay log, say from position B to A where
    A < B
 2. Then START SLAVE is executed
 3. Then MASTER_POS_WAIT is called, and the position that MASTER_POS_WAIT
    waits for is between A and B, say position C where A < C <= B.

Then MASTER_POS_WAIT can return prematurely, before the position C has been reached.

How to repeat:
# The following test shows that even after MASTER_POS_WAIT,
# table t1 does not exist on slave. It contains a race, so
# it is not guaranteed to always show the bug.

--source include/have_binlog_format_statement.inc
--source include/master-slave.inc

CREATE TABLE t1 (a INT);
--let $master_file= query_get_value(SHOW MASTER STATUS, File, 1)
--let $master_pos= query_get_value(SHOW MASTER STATUS, Position, 1)
--sync_slave_with_master
STOP SLAVE;
DROP TABLE t1;
CHANGE MASTER TO RELAY_LOG_POS = 4;
START SLAVE;
eval SELECT MASTER_POS_WAIT('$master_file', $master_pos, 0);
SHOW TABLES;
--sleep 1
SHOW TABLES;

Suggested fix:
Probably CHANGE MASTER TO RELAY_LOG_POS only updates the SQL thread's slave relay log coordinates, not the SQL thread's master binlog coordinates. The subsequent call to MASTER_POS_WAIT then thinks that the master binlog coordinates that were active before CHANGE MASTER are still active.

There are two possible approaches to fix this bug:

 1. Make CHANGE MASTER TO RELAY_LOG_POS set a flag that indicates the master
    binlog coordinates are invalid, and make MASTER_POS_WAIT wait for this flag
    to be updated. It will be updated after the first event has been read.

 2. Make CHANGE MASTER TO RELAY_LOG_POS read the master position in the
    first event.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127257

3213 Sven Sandberg	2010-12-19
      This part of the test fails sporadically because of BUG#59037.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/127258

3448 Sven Sandberg	2010-12-19 [merge]
      Merged patch that disabled part of rpl_change_master
      due to BUG#59037 from 5.5-bugteam to trunk-bugfixing.

No patch has been pushed for this bug.
A patch that disables part of rpl_change_master.test that fails due to this
bug will soon be pushed to 5.5-bugteam and trunk-bugfixing.

I was not yet able to repeat with 5.5.8, but on 5.1.54 it is easily repeatable:

macbook-pro:mysql-test openxs$ ./mtr bug59037
Logging: ./mtr  bug59037
101219 17:28:27 [Warning] Setting lower_case_table_names=2 because file system for /var/folders/dX/dXCzvuSlHX4Op1g-o1jIWk+++TI/-Tmp-/FL7ceeI9De/ is case insensitive
101219 17:28:27 [Note] Plugin 'FEDERATED' is disabled.
101219 17:28:27 [Note] Plugin 'ndbcluster' is disabled.
MySQL Version 5.1.54
Checking supported features...
 - skipping ndbcluster
 - SSL connections supported
 - binaries are debug compiled
Collecting tests...
vardir: /Users/openxs/dbs/5.1/mysql-test/var
Checking leftover processes...
Removing old var directory...
Creating var directory '/Users/openxs/dbs/5.1/mysql-test/var'...
Installing system database...
Using server port 63053

==============================================================================

TEST                                      RESULT   TIME (ms)
------------------------------------------------------------

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
main.bug59037                            [ fail ]
        Test ended at 2010-12-19 17:28:38

CURRENT_TEST: main.bug59037
--- /Users/openxs/dbs/5.1/mysql-test/r/bug59037.result	2010-12-19 18:28:24.000000000 +0300
+++ /Users/openxs/dbs/5.1/mysql-test/r/bug59037.reject	2010-12-19 18:28:38.000000000 +0300
@@ -0,0 +1,19 @@
+stop slave;
+drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
+reset master;
+reset slave;
+drop table if exists t1,t2,t3,t4,t5,t6,t7,t8,t9;
+start slave;
+CREATE TABLE t1 (a INT);
+STOP SLAVE;
+DROP TABLE t1;
+CHANGE MASTER TO RELAY_LOG_POS = 4;
+START SLAVE;
+SELECT MASTER_POS_WAIT('master-bin.000001', 192, 0);
+MASTER_POS_WAIT('master-bin.000001', 192, 0)
+0
+SHOW TABLES;
+Tables_in_test
+SHOW TABLES;
+Tables_in_test
+t1
...

Pushed into mysql-trunk 5.6.1 (revid:alexander.nozdrin@oracle.com-20101222212842-y0t3ibtd32wd9qaw) (version source revid:alexander.nozdrin@oracle.com-20101222212842-y0t3ibtd32wd9qaw) (merge vers: 5.6.1) (pib:24)

Pushed into mysql-5.5 5.5.9 (revid:alexander.nozdrin@oracle.com-20101229113652-km2v993aurv7h79j) (version source revid:alexander.nozdrin@oracle.com-20101229113132-uonlbcc2uopff8yb) (merge vers: 5.5.9) (pib:24)

Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Fixed in trunk. Documented in the 5.7.1 changelog as follows:

        It was possible for the MASTER_POS_WAIT() function to return
        prematurely following a CHANGE MASTER TO statement that updated
        the RELAY_LOG_POS or RELAY_LOG_NAME. This could happen because
        CHANGE MASTER TO did not update the master log position in such
        cases, causing MASTER_POS_WAIT() to read an invalid log position
        and to return immediately.

        To fix this problem, the master log position is flagged as
        invalid until the position is set to a valid value when the SQL
        thread reads the first event, after which it is flagged as
        valid. Functions such as MASTER_POS_WAIT() now defer any
        comparison with the master log position until a valid value can
        be obtained (that is, after the first event following the CHANGE
        MASTER TO statement has been applied).

Closed.