MySQL Bugs: #13023: With --log-slave-updates Exec_master_log

Bug #13023	With --log-slave-updates Exec_master_log_pos of SQL thread lags IO
Submitted:	6 Sep 2005 22:23	Modified:	18 Oct 2005 9:36
Reporter:	John Chung	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S1 (Critical)
Version:	4.1.12-standard for ibm-linux-gnu on s39	OS:	Linux (SUSE Linux 9)
Assigned to:	Guilhem Bichot	CPU Architecture:	Any

Description:
In an round-robin replicated environment using two MySQL Server, SQL Thread of MASTER (Server-ID 1) Server stops reading relay-log after issuing DML or DDL. All DML and DDL gets to SLAVE (Server-ID 2) server without any problem. But, if I issue DML or DDL on SLAVE (Server-ID 2) Server, the DML and DDL replicates back to MASTER (Server-ID 1) without any problem and it forces SQL Thread of MASTER ( Server-ID 1) to move but SQL Thread at SLAVE (Server-ID 2) gets stuck.

How to repeat:
How I start mysql:

   nohup $MYSQLD --defaults-file=/opt/MYSQL/mysql/data/my.cnf \
--basedir=/opt/MYSQL/mysql \
--datadir=/opt/MYSQL/mysql/data \
--pid-file=/tmp/mysql.pid \
--skip-locking \
--port=3306 \
--socket=/tmp/mysql.sock \
--log \
--warnings \
--log-slow-queries \
--log-queries-not-using-indexes \
--log-error  &

my.cnf of Server-ID 1
[client]
#password       = your_password
port            = 3306
socket          = /tmp/mysql.sock

# Here follows entries for some specific programs

# The MySQL server
[mysqld]
port            = 3306
socket          = /tmp/mysql.sock
basedir         = /opt/MYSQL/mysql
datadir         = /opt/MYSQL/mysql/data
pid-file        = /tmp/mysql.pid
skip-locking
key_buffer = 16M
max_allowed_packet = 16M
table_cache = 64
sort_buffer_size = 512K
net_buffer_length = 8K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 8M
max_binlog_size=100M
max_connections=110
max_user_connections=110

# Don't listen on a TCP/IP port at all. This can be a security enhancement,
# if all processes that need to connect to mysqld run on the same host.
# All interaction with mysqld must be made via Unix sockets or named pipes.
# Note that using this option without enabling named pipes on Windows
# (via the "enable-named-pipe" option) will render mysqld useless!
#
#skip-networking

# Replication Information
log-bin
log-slave-updates
server-id       = 2
# The replication master for this slave - required
master-host     =   10.9.196.152
master-user     =   replicate
master-password =   rep567sec
master-port     =   3306
master-connect-retry=30
report-host     =   mt2idtrack01

# Point the following paths to different dedicated disks
#tmpdir         = /tmp/
#log-update     = /path-to-dedicated-directory/hostname

# Uncomment the following if you are using BDB tables
#bdb_cache_size = 4M
#bdb_max_lock = 10000

# Uncomment the following if you are using InnoDB tables
innodb_data_home_dir = /opt/MYSQL/mysql/data
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /opt/MYSQL/mysql/data
innodb_log_arch_dir = /opt/MYSQL/mysql/data
# You can set .._buffer_pool_size up to 50 - 80 %
# of RAM but beware of setting memory usage too high
innodb_buffer_pool_size = 108M
innodb_additional_mem_pool_size = 32M
# Set .._log_file_size to 25 % of buffer pool size
innodb_log_file_size = 5M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50

[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash
# Remove the next comment character if you are not familiar with SQL
#safe-updates

[isamchk]
key_buffer = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[myisamchk]
key_buffer = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[mysqlhotcopy]
interactive-timeout

-- After setting up replication do the following:
1. At Server-ID 1
   mysql -uroot -p
   use test;
   create table rep_test (a int);

2. At Server-ID 2
   mysql -uroot -p
   use test;
   insert into rep_test value (1);

   You will be able to see that SQL Thread for Server-ID 1 will be stuck at it's original place.

3. At Server-ID 1
   mysql -uroot -p
   use test;
   insert into rep_test value (2);

   You will be able to see that SQL Thread for Server-ID 2 will be stuck. But, SQL Thread for Server-ID 1 will move 'till the end of Bin-Log.

Suggested fix:
Start issuing DML on both servers to force SQL Thread on both server to move. (This is not good as it will create unnecessary transaction)

Forgot to mention something important. During my testing, IO and SQL thread was never down. Nor, there was any error logged in mysql error log.

Although SQL and IO threads are up and running, SQL Thread does not move forward.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30878

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30899

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30955

fixed in 5.0.15 and 4.1.16:
ChangeSet@1.2473, 2005-10-12 22:29:36+02:00, guilhem@mysql.com
  To force a restart at the end of test, the option file must be non-empty, it's not enough if it exists and is empty.
and
ChangeSet@1.2471, 2005-10-12 13:29:55+02:00, guilhem@mysql.com
  Fix for BUG#13023: "SQL Thread is up but doesn't move forward". Details in slave.cc;

info: the bug was that in a *circular replication* A->B->C->...->A (with possibly C not existing),
*IF* B had --log-slave-updates, then SHOW SLAVE STATUS on A could report an Exec_master_log_pos smaller than the binlog position displayed by SHOW MASTER STATUS on B even though A was fully caught up in terms of data changes. This happened if the last thing which had occured on the A&B setup is an update on A.
The reason was that an UPDATE on A goes to A's binlog, then to B, into B's binlog (because of log-slave-updates) then is sent back to A. A's slave I/O thread ignores the event (because it recognizes it comes from itself in fact, using server ids comparison), and so does not write it to A's relay log, so A's slave SQL thread will never know about this event. So Exec_master_log_pos (as it's maintained by the SQL thread) will never take this event into account.
While this event is of course taken into account by SHOW MASTER STATUS on B as it's in B's binlog. Hence the position mismatch.
A test mysql-test/t/rpl_dual_pos_advance.test has this scenario if needed.

Documented in 4.1.16 and 5.0.15 changelogs:

<listitem>
        <para>
          SQL thread could hang in multi-master circular replication.
          (Bug #13023)
        </para>
      </listitem>

Hi Mike,
No, no, it's really like I wrote before, the symptom is a too low Exec_master_log_pos. Not a hang.
The test I pushed was designed to hang in this case, because that was an easy way to test (it hans because it does a MASTER_POS_WAIT() of a position which never advances, thus the client thread which issues MASTER_POS_WAIT() hangs but not the slave SQL or I/O threads).
Please could you change the changelog entry to rather insist on the wrong Exec_master_log_pos (that's the reason the customer reported the bug).

Mike,

New title for this bug. Just changed from:

SQL Thread is up but doesn't move forward

to:

With --log-slave-updates Exec_master_log_pos of SQL thread lags IO [thread in circular replication]

Removing the words "multi-master" is to eliminate the chance for confusion with multiple masters for one server, at the request of the original bug reporter. This revised wording might do well for the changelog.

New bug 13861 covers the unusual case where START SLAVE UNTIL is used, not resolved by this patch.