MySQL Bugs: #31190: Exec_Master_Log_Pos in SHOW SLAVE STATUS can be incorrect in some cases

Bug #31190	Exec_Master_Log_Pos in SHOW SLAVE STATUS can be incorrect in some cases
Submitted:	25 Sep 2007 18:31	Modified:	29 Feb 2008 15:51
Reporter:	Harrison Fisk	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.0.30sp1-5.0.46	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	bfsm_2007_10_18, Exec_master_log_pos, replication, stop slave

Description:
In certain circumstances the Exec_master_log_pos value in SHOW SLAVE STATUS can be reported as incorrect.  It reports a very small size (the size of the event it seems) and not the correct value.

The behavior required to reproduce it:

1.  Use transactions
2.  Update a non-transactional table during the transaction
3.  Issue STOP SLAVE at exactly the correct time during log processing

Inside of the transaction log, the end_log_pos is incorrect (as documented).  Normally, this isn't a problem as the slave always keeps track of the beginning of the transaction location to report in SHOW SLAVE STATUS.  However, due to the non-transactional table, it can report the incorrect position in some cases.

Exec_master_log_pos is used for many different things, such as correcting corrupt relay logs, setting up a new slave, etc... so this information *must* be correct.

The show slave status with the incorrect information:

mysql> show slave status\G
*************************** 1. row ***************************
             Slave_IO_State:
                Master_Host: 192.168.1.105
                Master_User: rep
                Master_Port: 3306
              Connect_Retry: 60
            Master_Log_File: hfisk-desktop-bin.000009
        Read_Master_Log_Pos: 289315421
             Relay_Log_File: ubuntu3-relay-bin.001329
              Relay_Log_Pos: 830664
      Relay_Master_Log_File: hfisk-desktop-bin.000009
           Slave_IO_Running: No
          Slave_SQL_Running: No
            Replicate_Do_DB:
        Replicate_Ignore_DB:
         Replicate_Do_Table:
     Replicate_Ignore_Table:
    Replicate_Wild_Do_Table:
Replicate_Wild_Ignore_Table:
                 Last_Errno: 0
                 Last_Error:
               Skip_Counter: 0
        Exec_Master_Log_Pos: 78
            Relay_Log_Space: 1048828
            Until_Condition: None
             Until_Log_File:
              Until_Log_Pos: 0
         Master_SSL_Allowed: No
         Master_SSL_CA_File:
         Master_SSL_CA_Path:
            Master_SSL_Cert:
          Master_SSL_Cipher:
             Master_SSL_Key:
      Seconds_Behind_Master: NULL
1 row in set (0.01 sec)

Error log contents proving that position 78 is incorrect:

070923  3:44:45 [Note] Slave SQL thread initialized, starting replication in log 'hfisk-desktop-bin.000009' at position 288796019, relay log '/usr/local/mysql/data/ubuntu3-relay-bin.001329' position: 528932
070923  3:44:45 [Note] Slave I/O thread: connected to master 'rep@192.168.1.105:3306', replication started in log 'hfisk-desktop-bin.000009' at position 289315421
070923  3:44:48 [Note] Slave I/O thread exiting, read up to log 'hfisk-desktop-bin.000009', position 289315421
070923  3:44:48 [Note] Slave SQL thread exiting, replication stopped in log 'hfisk-desktop-bin.000009' at position 78

Notice that the SQL thread started with:  hfisk-desktop-bin.000009' at position 288796019 and ended with: 'hfisk-desktop-bin.000009' at position 78

I will attach the relay log ubuntu3-relay-bin.001329 for your viewing.

How to repeat:
1.  Setup replication
2.  Create 1 transactional table and 1 non-transaction table, for example:

CREATE TABLE test (
   a int
) ENGINE=memory;

CREATE TABLE test2 (
   a int
) ENGINE=innodb;

3.  Setup a process to insert/truncate from these tables.  The script I used with gypsy was:

n|1|BEGIN;
n|2|INSERT INTO test VALUES (?)|int
n|2|INSERT INTO test2 VALUES (?)|int
n|1|COMMIT;

With truncate occuring occasionally from a batch process.

4.  Create a script to stop replication and check the position, I used:

#/bin/bash
while true; do
        mysqladmin -u root stop-slave
        mysql --vertical -u root -e "show slave status" | grep "Relay_Mast\|Exec"
        FOO=`mysql -e "show slave status\G" | grep "Exec" | awk '{ print $2}'`
        if [ $FOO -lt 1000 ]
        then
                echo "Found one";
                exit
        fi
        sleep 1
        mysqladmin -u root start-slave
        mysql --vertical -u root -e "show slave status" | grep "Relay_Mast\|Exec"
        sleep 1
done

5. Wait a while with both processes running for the script to stop

Suggested fix:
Make the Exec_master_log_pos always correct.  Since Exec_master_log_pos relies on it, this will most likely involve making sure that the end_log_pos is correct for all statements.

Relay log which caused the problem

Attachment: ubuntu3-relay-bin.001329.gz (application/x-gzip, text), 122.83 KiB.

It might be this one is a duplicate for bug#12691. There is a note on bug#12691 page explaining a feature in behaviour of 
"Exec_master_log_pos is not getting corrupted although the couter indeed behaves sometimes non-monotonically ...".
As the patch for the earlier bug 
" [27 Nov 2007 11:48] Bugs System Pushed into 5.0.54 "
we indeed may not reproduce the current, in either version.

I am experiencing the same problem.. It's not often, but it happens. We have noticed this on 4.1.22, 5.0.32 (enterprise), 5.0.50 (enterprise).. 

Today for example, it happened again. We stopped replication with "slave stop;", did a "show slave status\G" and issued a "change master..." And the position was wrong. Now, when checking the logs, relay logs and masters binlog, in fact the exec_master_log_pos reported on the slave does not exist on the master. If i would restart the slave before issuing a "change master" i guess there would be no problem, since the slave would continue reading it's relaylogs and so on, but since we wanted to switch from using a ssh tunnel between the master and the slave to our new wan connection, we needed to change host. 

<snip from mysqld.log>
080227  4:27:50 [Note] Slave I/O thread: connected to master 'rep@127.0.0.1:3351',  replication started in log 'x-bin.001507' at position 237826070
080227  4:37:52 [Note] Slave I/O thread exiting, read up to log 'x-bin.001507', position 281271771
080227  4:37:52 [Note] Slave SQL thread exiting, replication stopped in log 'x-bin.001507' at position 1756

080227  4:41:57 [Note] Slave SQL thread initialized, starting replication in log 'x-bin.001507' at position 1756, relay log '/db/x/log/relaylog/x-relay.000001' position: 98
080227  4:41:58 [Note] Slave I/O thread: connected to master 'rep@192.168.30.5:3306',  replication started in log 'x-bin.001507' at position 1756
080227  4:41:58 [ERROR] Error reading packet from server: error reading log entry ( server_errno=1236)
080227  4:41:58 [ERROR] Got fatal error 1236: 'error reading log entry' from master when reading data from binary log
080227  4:41:58 [Note] Slave I/O thread exiting, read up to log 'x-bin.001507', position 1756
080227  4:44:33 [Note] Error reading relay log event: slave SQL thread was killed
</snip from mysqld.log>

<snip from binlog>

[root@host binlog]# mysqlbinlog --start-position=1756 x-bin.001507
/*!40019 SET @@session.max_insert_delayed_threads=0*/;
/*!50003 SET @OLD_COMPLETION_TYPE=@@COMPLETION_TYPE,COMPLETION_TYPE=0*/;
ERROR: Error in Log_event::read_log_event(): 'Event too big', data_len: 1751607660, event_type: 34
# End of log file
ROLLBACK /* added by mysqlbinlog */;
/*!50003 SET COMPLETION_TYPE=@OLD_COMPLETION_TYPE*/;

</snip from binlog>

This could be serious if you have a slave that you really depend upon.. In this case, we could just roll back to a snapshot of this backup database (we do binary snapshots in the SAN for this backup database every hour), but say that it was in production.. - it could cause a big mess.

I have not been able to repeat this under 5.0.54 or above.  Previously, I was able to repeat it in under 5 minutes regularly, and I let a test run for 4 hours without repeating it.  

So it seems to me that it is indeed fixed, or at least the issue my test case was revealing was fixed.

See bug #12691: Exec_master_log_pos corrupted with SQL_SLAVE_SKIP_COUNTER