Bug #13023 With --log-slave-updates Exec_master_log_pos of SQL thread lags IO
Submitted: 6 Sep 2005 22:23 Modified: 18 Oct 2005 9:36
Reporter: John Chung Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:4.1.12-standard for ibm-linux-gnu on s39 OS:Linux (SUSE Linux 9)
Assigned to: Guilhem Bichot CPU Architecture:Any

[6 Sep 2005 22:23] John Chung
Description:
In an round-robin replicated environment using two MySQL Server, SQL Thread of MASTER (Server-ID 1) Server stops reading relay-log after issuing DML or DDL. All DML and DDL gets to SLAVE (Server-ID 2) server without any problem. But, if I issue DML or DDL on SLAVE (Server-ID 2) Server, the DML and DDL replicates back to MASTER (Server-ID 1) without any problem and it forces SQL Thread of MASTER ( Server-ID 1) to move but SQL Thread at SLAVE (Server-ID 2) gets stuck.

How to repeat:
How I start mysql:

   nohup $MYSQLD --defaults-file=/opt/MYSQL/mysql/data/my.cnf \
--basedir=/opt/MYSQL/mysql \
--datadir=/opt/MYSQL/mysql/data \
--pid-file=/tmp/mysql.pid \
--skip-locking \
--port=3306 \
--socket=/tmp/mysql.sock \
--log \
--warnings \
--log-slow-queries \
--log-queries-not-using-indexes \
--log-error  &

my.cnf of Server-ID 1
[client]
#password       = your_password
port            = 3306
socket          = /tmp/mysql.sock

# Here follows entries for some specific programs

# The MySQL server
[mysqld]
port            = 3306
socket          = /tmp/mysql.sock
basedir         = /opt/MYSQL/mysql
datadir         = /opt/MYSQL/mysql/data
pid-file        = /tmp/mysql.pid
skip-locking
key_buffer = 16M
max_allowed_packet = 16M
table_cache = 64
sort_buffer_size = 512K
net_buffer_length = 8K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
myisam_sort_buffer_size = 8M
max_binlog_size=100M
max_connections=110
max_user_connections=110

# Don't listen on a TCP/IP port at all. This can be a security enhancement,
# if all processes that need to connect to mysqld run on the same host.
# All interaction with mysqld must be made via Unix sockets or named pipes.
# Note that using this option without enabling named pipes on Windows
# (via the "enable-named-pipe" option) will render mysqld useless!
#
#skip-networking

# Replication Information
log-bin
log-slave-updates
server-id       = 2
# The replication master for this slave - required
master-host     =   10.9.196.152
master-user     =   replicate
master-password =   rep567sec
master-port     =   3306
master-connect-retry=30
report-host     =   mt2idtrack01

# Point the following paths to different dedicated disks
#tmpdir         = /tmp/
#log-update     = /path-to-dedicated-directory/hostname

# Uncomment the following if you are using BDB tables
#bdb_cache_size = 4M
#bdb_max_lock = 10000

# Uncomment the following if you are using InnoDB tables
innodb_data_home_dir = /opt/MYSQL/mysql/data
innodb_data_file_path = ibdata1:10M:autoextend
innodb_log_group_home_dir = /opt/MYSQL/mysql/data
innodb_log_arch_dir = /opt/MYSQL/mysql/data
# You can set .._buffer_pool_size up to 50 - 80 %
# of RAM but beware of setting memory usage too high
innodb_buffer_pool_size = 108M
innodb_additional_mem_pool_size = 32M
# Set .._log_file_size to 25 % of buffer pool size
innodb_log_file_size = 5M
innodb_log_buffer_size = 8M
innodb_flush_log_at_trx_commit = 1
innodb_lock_wait_timeout = 50

[mysqldump]
quick
max_allowed_packet = 16M

[mysql]
no-auto-rehash
# Remove the next comment character if you are not familiar with SQL
#safe-updates

[isamchk]
key_buffer = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[myisamchk]
key_buffer = 20M
sort_buffer_size = 20M
read_buffer = 2M
write_buffer = 2M

[mysqlhotcopy]
interactive-timeout

-- After setting up replication do the following:
1. At Server-ID 1
   mysql -uroot -p
   use test;
   create table rep_test (a int);

2. At Server-ID 2
   mysql -uroot -p
   use test;
   insert into rep_test value (1);

   You will be able to see that SQL Thread for Server-ID 1 will be stuck at it's original place.

3. At Server-ID 1
   mysql -uroot -p
   use test;
   insert into rep_test value (2);

   You will be able to see that SQL Thread for Server-ID 2 will be stuck. But, SQL Thread for Server-ID 1 will move 'till the end of Bin-Log.

Suggested fix:
Start issuing DML on both servers to force SQL Thread on both server to move. (This is not good as it will create unnecessary transaction)
[6 Sep 2005 22:28] John Chung
Forgot to mention something important. During my testing, IO and SQL thread was never down. Nor, there was any error logged in mysql error log.
[14 Sep 2005 19:19] John Chung
Although SQL and IO threads are up and running, SQL Thread does not move forward.
[10 Oct 2005 21:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30878
[11 Oct 2005 7:27] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30899
[12 Oct 2005 11:32] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/30955
[12 Oct 2005 22:40] Guilhem Bichot
fixed in 5.0.15 and 4.1.16:
ChangeSet@1.2473, 2005-10-12 22:29:36+02:00, guilhem@mysql.com
  To force a restart at the end of test, the option file must be non-empty, it's not enough if it exists and is empty.
and
ChangeSet@1.2471, 2005-10-12 13:29:55+02:00, guilhem@mysql.com
  Fix for BUG#13023: "SQL Thread is up but doesn't move forward". Details in slave.cc;
[12 Oct 2005 22:48] Guilhem Bichot
info: the bug was that in a *circular replication* A->B->C->...->A (with possibly C not existing),
*IF* B had --log-slave-updates, then SHOW SLAVE STATUS on A could report an Exec_master_log_pos smaller than the binlog position displayed by SHOW MASTER STATUS on B even though A was fully caught up in terms of data changes. This happened if the last thing which had occured on the A&B setup is an update on A.
The reason was that an UPDATE on A goes to A's binlog, then to B, into B's binlog (because of log-slave-updates) then is sent back to A. A's slave I/O thread ignores the event (because it recognizes it comes from itself in fact, using server ids comparison), and so does not write it to A's relay log, so A's slave SQL thread will never know about this event. So Exec_master_log_pos (as it's maintained by the SQL thread) will never take this event into account.
While this event is of course taken into account by SHOW MASTER STATUS on B as it's in B's binlog. Hence the position mismatch.
A test mysql-test/t/rpl_dual_pos_advance.test has this scenario if needed.
[13 Oct 2005 2:03] Mike Hillyer
Documented in 4.1.16 and 5.0.15 changelogs:

<listitem>
        <para>
          SQL thread could hang in multi-master circular replication.
          (Bug #13023)
        </para>
      </listitem>
[13 Oct 2005 6:58] Guilhem Bichot
Hi Mike,
No, no, it's really like I wrote before, the symptom is a too low Exec_master_log_pos. Not a hang.
The test I pushed was designed to hang in this case, because that was an easy way to test (it hans because it does a MASTER_POS_WAIT() of a position which never advances, thus the client thread which issues MASTER_POS_WAIT() hangs but not the slave SQL or I/O threads).
Please could you change the changelog entry to rather insist on the wrong Exec_master_log_pos (that's the reason the customer reported the bug).
[13 Oct 2005 22:17] James Day
Mike,

New title for this bug. Just changed from:

SQL Thread is up but doesn't move forward

to:

With --log-slave-updates Exec_master_log_pos of SQL thread lags IO [thread in circular replication]

Removing the words "multi-master" is to eliminate the chance for confusion with multiple masters for one server, at the request of the original bug reporter. This revised wording might do well for the changelog.

New bug 13861 covers the unusual case where START SLAVE UNTIL is used, not resolved by this patch.