MySQL Bugs: #26622: MASTER_POS_WAIT does not work as documented

Bug #26622	MASTER_POS_WAIT does not work as documented
Submitted:	25 Feb 2007 20:05	Modified:	26 Mar 2008 20:16
Reporter:	Baron Schwartz (Basic Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.0.24a-Debian_9-log, 4.1 BK, 5.1 BK, 5.2-falcon	OS:	Linux (Linux, Ubuntu)
Assigned to:	Andrei Elkin	CPU Architecture:	Any
Tags:	qc

Description:
MASTER_POS_WAIT return values are different than expected when the server is not a slave.  It returns -1 instead of NULL.

How to repeat:
mysql> help master_pos_wait
Name: 'MASTER_POS_WAIT'
Description:
Syntax:
MASTER_POS_WAIT(log_name,log_pos[,timeout])

This function is useful for control of master/slave synchronization. It
blocks until the slave has read and applied all updates up to the
specified position in the master log. The return value is the number of
log events the slave had to wait for to advance to the specified
position. The function returns NULL if the slave SQL thread is not
started, the slave's master information is not initialized, the
arguments are incorrect, or an error occurs. It returns -1 if the
timeout has been exceeded. If the slave SQL thread stops while
MASTER_POS_WAIT() is waiting, the function returns NULL. If the slave
is past the specified position, the function returns immediately.

mysql> show slave status;
Empty set (0.00 sec)

mysql> select master_pos_wait('foo', 98);
+----------------------------+
| master_pos_wait('foo', 98) |
+----------------------------+
|                         -1 | 
+----------------------------+
1 row in set (0.00 sec)

test case

Attachment: rpl_bug26622.test (application/octet-stream, text), 171 bytes.

Thank you for the report.

Verified as described. All versions are affected.

As the help claims that "the function returns NULL if the slave SQL thread is not started" the result must be NULL not -1.

For the bug fixer: this patch should fix the problem:
===== sql/rpl_rli.cc 1.12 vs edited =====
*** /tmp/bk_rpl_rli.cc-1.12_F40QVS      2007-06-09 08:29:43 +02:00
--- edited/sql/rpl_rli.cc       2007-09-04 18:50:01 +02:00
***************
*** 611,617 ****
    DBUG_ENTER("st_relay_log_info::wait_for_pos");

    if (!inited)
!     DBUG_RETURN(-1);

    DBUG_PRINT("enter",("log_name: '%s'  log_pos: %lu  timeout: %lu",
                        log_name->c_ptr(), (ulong) log_pos, (ulong) timeout));
--- 611,617 ----
    DBUG_ENTER("st_relay_log_info::wait_for_pos");

    if (!inited)
!     DBUG_RETURN(-2);

    DBUG_PRINT("enter",("log_name: '%s'  log_pos: %lu  timeout: %lu",
                        log_name->c_ptr(), (ulong) log_pos, (ulong) timeout));
and something similar for 5.0.

Any hint on when this bug will be fixed in 5.0 branch? We were planning to use 5.0.45 for a production environment with our own way to manage replication and we depend on MASTER_POS_WAIT to work as documented, otherwise we will have to work out a dirty workaround for this.

We have implemented a workaround for this that stays in a while loop if master_pos_wait returns -1 but exits after n retries.

Still after master_pos_wait returns -1 incorrectly and we decide to break the loop and go on using that DB, all subsequent writes are rejected. We are using apache commons pool with JDBC driver and we get this error:

Can't execute the query because you have a conflicting read lock

Could this be a side effect of the master_pos_wait erroneous return value? I know our setup is specific (master-master replication and using two connection pools to access the DBs), but don't know how to reproduce it using your test case.... 

We have even tried to execute an unlock tables after aborting the while loop but with no effect. Is something we might be doing wrong? The user has enough permission. Doing unlock tables with root user through mysql has no effect.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/43517

ChangeSet@1.2595, 2008-03-06 14:49:21+02:00, aelkin@mysql1000.(none) +3 -0
  Bug#26622  MASTER_POS_WAIT does not work as documented
  
  MASTER_POS_WAIT return values are different than expected when the server is not a slave. 
  It returns -1 instead of NULL.
  
  Fixed with correcting  st_relay_log_info::wait_for_pos() to return the proper
  value in the case of rli info is not inited.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/43617

ChangeSet@1.2596, 2008-03-07 21:14:28+02:00, aelkin@mysql1000.(none) +6 -0
  Bug #26622 MASTER_POS_WAIT does not work as documented
  
  Affected tests fixing. After the fix for st_relay_log_info::wait_for_pos() that
  handles widely used select('master-bin.xxxx',pos) invoked by mysqltest
  there appeared to be four tests that either tried synchronizing when
  the slave was stopped or used incorrect synchronization method like
  to call `sync_with_master' from the current connection being to the
  master itself.
  
  Fixed with correcting the current connection or/and using the correct
  synchronization macro when possible.

Pushed into 5.0.60

Pushed into 5.1.24-rc

Pushed into 6.0.5-alpha

Documented in the 5.0.60, 5.1.24, and 6.0.5 changelogs as follows:

        MASTER_POS_WAIT() did not return NULL when the server was not a slave.

Also noted in the 5.1.23-ndb-6.3.11 changelog.

Also documented in the 5.1.23-ndb-6.3.11 changelog.