MySQL Bugs: #73110: mysqlrpladmin does not check errant transactions

Bug #73110	mysqlrpladmin does not check errant transactions
Submitted:	25 Jun 2014 9:38	Modified:	16 Jul 2014 4:52
Reporter:	Stephane Combaudon	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Utilities	Severity:	S1 (Critical)
Version:	1.4.3	OS:	Any
Assigned to:	Paulo Jesus	CPU Architecture:	Any

Description:
Although mysqlrpladmin is supposed to check errant transactions if --force is not used, it doesn't do so, making it dangerous to use for any GTID-based failover/switchover.

How to repeat:
1) Set up a master and 2 slaves, using GTID replication.
Here is my config:
$ mysqlrpladmin --master=root@localhost:13001 --discover-slaves-login=root --quiet health
# Discovering slave at localhost:13002
# Found slave: localhost:13002
# Discovering slave at localhost:13003
# Found slave: localhost:13003
+------------+--------+---------+--------+------------+---------+
| host       | port   | role    | state  | gtid_mode  | health  |
+------------+--------+---------+--------+------------+---------+
| localhost  | 13001  | MASTER  | UP     | ON         | OK      |
| localhost  | 13002  | SLAVE   | UP     | ON         | OK      |
| localhost  | 13003  | SLAVE   | UP     | ON         | OK      |
+------------+--------+---------+--------+------------+---------+

2) Create an errant transaction on localhost:13003 and remove the corresponding entry in the binlogs:
mysql> CREATE DATABASE test2;

mysql> flush logs;
Query OK, 0 rows affected (0,02 sec)

mysql> show binary logs;
+------------------+-----------+
| Log_name         | File_size |
+------------------+-----------+
| mysql-bin.000001 |     69309 |
| mysql-bin.000002 |   1237667 |
| mysql-bin.000003 |       617 |
| mysql-bin.000004 |       231 |
+------------------+-----------+

sb3> purge binary logs to 'mysql-bin.000004';

3) Switch over to localhost:13003 using mysqlrpladmin:
$ mysqlrpladmin --master=root@127.0.0.1:13001 --new-master=root@127.0.0.1:13003 --discover-slaves-login=root --demote-master --rpl-user=repl:repl --verbose switchover
# Discovering slaves for master at 127.0.0.1:13001
# Discovering slave at localhost:13002
# Found slave: localhost:13002
# Discovering slave at localhost:13003
# Found slave: localhost:13003
# Checking privileges.
# WARNING: You may be mixing host names and IP addresses. This may result in negative status reporting if your DNS services do not support reverse name lookup.
# Performing switchover from master at 127.0.0.1:13001 to slave at 127.0.0.1:13003.
# Checking candidate slave prerequisites.
# GTID_MODE=ON is set for all servers.
# Checking eligibility of slave 127.0.0.1:13003 for candidate.
#   Slave connected to master ... Ok
#   GTID_MODE=ON ... Ok
#   Logging filters agree ... Ok
#   Replication user exists ... Ok
# Checking slaves configuration to master.
# When the master_info_repository variable is set to FILE, the --rpl-user option may be used only if the user specified matches what is shown in the SLAVE STATUS output unless the --force option is used.
# Creating replication user if it does not exist.
# Blocking writes on master.
# LOCK STRING: FLUSH TABLES WITH READ LOCK
# Waiting for slaves to catch up to old master.
# Slave localhost:13002:
# QUERY = SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS('53c30a4c-fc4b-11e3-b006-9cebe8067a3f:1-2', 300)
# Return Code = 0
# Slave localhost:13003:
# QUERY = SELECT WAIT_UNTIL_SQL_THREAD_AFTER_GTIDS('53c30a4c-fc4b-11e3-b006-9cebe8067a3f:1-2', 300)
# Return Code = 0
# Stopping slaves.
# Performing STOP on all slaves.
#   Executing stop on slave localhost:13002 Ok
#   Executing stop on slave localhost:13003 Ok
# UNLOCK STRING: UNLOCK TABLES
# Demoting old master to be a slave to the new master.
# Switching slaves to new master.
# Executing CHANGE MASTER on localhost:13002.
# CHANGE MASTER TO MASTER_HOST = '127.0.0.1', MASTER_USER = 'repl', MASTER_PASSWORD = 'repl', MASTER_PORT = 13003, MASTER_AUTO_POSITION=1
# Executing CHANGE MASTER on 127.0.0.1:13001.
# CHANGE MASTER TO MASTER_HOST = '127.0.0.1', MASTER_USER = 'repl', MASTER_PASSWORD = 'repl', MASTER_PORT = 13003, MASTER_AUTO_POSITION=1
# Starting all slaves.
# Performing START on all slaves.
#   Executing start on slave localhost:13002 Ok
#   Executing start on slave 127.0.0.1:13001 Ok
# Checking slaves for errors.
# localhost:13002 status: Ok 
# 127.0.0.1:13001 status: Ok 
# Switchover complete.
# Attempting to contact 127.0.0.1 ... Success
# Attempting to contact localhost ... Success
# Attempting to contact 127.0.0.1 ... Success
#
# Replication Topology Health:
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------+-----------------+------------+-------------+--------------+------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+---------------+
| host       | port   | role    | state  | gtid_mode  | health                                                                                                                                                                                                                                                                                              | version     | master_log_file   | master_log_pos  | IO_Thread  | SQL_Thread  | Secs_Behind  | Remaining_Delay  | IO_Error_Num  | IO_Error                                                                                                                                                                                                                        | SQL_Error_Num  | SQL_Error  | Trans_Behind  |
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------+-----------------+------------+-------------+--------------+------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+---------------+
| 127.0.0.1  | 13003  | MASTER  | UP     | ON         | OK                                                                                                                                                                                                                                                                                                  | 5.6.19-log  | mysql-bin.000004  | 570             |            |             |              |                  |               |                                                                                                                                                                                                                                 |                |            |               |
| 127.0.0.1  | 13001  | SLAVE   | UP     | ON         | IO thread is not running., Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Slave has 1 transactions behind master.  | 5.6.19-log  |                   | 4               | No         | Yes         | 0            | No               | 1236          | Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'  | 0              |            | 1             |
| localhost  | 13002  | SLAVE   | UP     | ON         | IO thread is not running., Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.', Slave has 1 transactions behind master.  | 5.6.19-log  |                   | 4               | No         | Yes         | 0            | No               | 1236          | Got fatal error 1236 from master when reading data from binary log: 'The slave is connecting using CHANGE MASTER TO MASTER_AUTO_POSITION = 1, but the master has purged binary logs containing GTIDs that the slave requires.'  | 0              |            | 1             |
+------------+--------+---------+--------+------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+-------------+-------------------+-----------------+------------+-------------+--------------+------------------+---------------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------+------------+---------------+
# ...done.

Now replication is broken

Hello. We need a bit more information to help diagnose this issue. 

- What version of the server are you using? 
- I see you stated you are using Utilities 1.4.3. Which platform are you using to run the utilities? The servers?

We define errant transactions as transactions that are active in the topology on one or more slaves but not on the master. Purging the binary log is an aberration that does not fall into the definition of an errant transaction. I think what you are describing is a lost or dropped transaction which may require additional features to detect and report. Regardless, we will attempt to reproduce the problem once we get the above information.

Hello Charles,

MySQL version is 5.6.19, and everything (all 3 servers and MySQL Utilities) is running on a single Ubuntu 14.04 box.

Purging the binary log is not an aberration at all, as it emulates the following scenario:
1) you write a transaction on a slave for some reason
2) because of expire_logs_days, the binary log containing the transaction is purged after some time
3) you promote the slave to be the new master

If the tool you use for slave promotion is not able to identify the offending transaction, it will create a replication error. So it has to be detected.

FWIW, I've also tested the same scenario but without purging the binary log (thus really creating what you call an errant transaction), and it's not detected.

Confirm the scenario presented. Errant transactions are not checked on switchover, only failover. Thus, this functionality is missing.

Quickly looking at the code, mysqlrpladmin calls the find_errant_transactions function like mysqlfailover does.

The check works well with mysqlfailover, not with mysqlrpladmin.

I'm not familiar enough with the code to see where it fails though.

Fixed as of the upcoming MySQL Utilities/Fabric 1.4.4 release, and here's the changelog entry:

The check for an errant transaction was not performed for
mysqlrpladmin's switchover command, it only performed for the failover
command. Both commands now check for errant transactions. 

The errant transaction check requires all servers in the topology to have
GTID enabled, otherwise a warning is issued indicating that the check
cannot be performed for the switchover command.

Thank you for the bug report.