MySQL Bugs: #83684: mysqlfailover fails to run CHANGE MASTER against 5.7.16 server

Bug #83684	mysqlfailover fails to run CHANGE MASTER against 5.7.16 server
Submitted:	4 Nov 2016 1:43	Modified:	28 Nov 2016 20:01
Reporter:	Owen Owen	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Utilities	Severity:	S2 (Serious)
Version:	1.6.4	OS:	CentOS (7.2)
Assigned to:		CPU Architecture:	Any

Description:
In a 3 node cluster with 1 master and 2 slaves, all running MySQL 5.7.16 on CentOS 7.2 - mysqlfailover does not failover.

mysqlfailover log:
2016-11-03 20:55:31 PM INFO Master may be down. Waiting for 3 seconds.
2016-11-03 20:55:58 PM INFO Failed to reconnect to the master after 3 attemps.
2016-11-03 20:55:58 PM CRITICAL Master is confirmed to be down or unreachable.
2016-11-03 20:55:58 PM INFO Failover starting in 'auto' mode...
2016-11-03 20:55:58 PM INFO Candidate slave 172.17.240.24:3307 will become the new master.
2016-11-03 20:55:58 PM INFO Checking slaves status (before failover).
2016-11-03 20:55:58 PM INFO Preparing candidate for failover.
2016-11-03 20:55:58 PM INFO Creating replication user if it does not exist.
2016-11-03 20:55:58 PM INFO Stopping slaves.
2016-11-03 20:55:58 PM INFO Performing STOP on all slaves.
2016-11-03 20:55:58 PM INFO Switching slaves to new master.
2016-11-03 20:55:58 PM INFO Unregistering instance on master.

Corresponding MySQL log.
2016-11-04T00:55:58.608175Z 67 [Note] Error reading relay log event for channel '': slave SQL thread was killed
2016-11-04T00:55:58.608507Z 66 [ERROR] Slave I/O for channel '': error reconnecting to master 'repl@172.17.48.26:3307' - retry-time: 60 retries: 1, Error_code: 2003
2016-11-04T00:55:58.608517Z 66 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
2016-11-04T00:55:58.608523Z 66 [Note] Slave I/O thread exiting for channel '', read up to log 'mysql-bin.000006', position 194
2016-11-04T00:55:58.619206Z 502 [Note] Aborted connection 502 to db: 'unconnected' user: 'root' host: '172.17.176.18' (Got an error reading communication packets)

In my observation, what happens is that the server.py supports_gtid() method attempts to use exec_query to run SELECT @@GLOBAL.GTID_MODE. However, the connection is aborted. (See the last line of MySQL error log) Since the connection is aborted, supports_gtid() ends up returning "NO." This means that replication.py make_change_master ends up using MASTER_LOG_FILE = instead of MASTER_AUTO_POSITION=1. This causes line 2208 of replication.py to throw an exception because since the server is using GTID, MASTER_LOG_FILE in CHANGE MASTER fails.

The end result is failover with mysqlfailover and 5.7.16 is broken. I can reproduce 100% of the time on servers with 5.7.16 but servers with 5.7.15 have no issues.

How to repeat:
Create a 3 node cluster with 1 master and 2 slaves on CentOS 7.2 and MySQL 5.7.16. Use iptables to cutoff the master from the slaves. Observe mysqlfailover not work.

Suggested fix:
Find out why MySQL aborts the connection that is used to find out if GTID is on.

I modified replication.py to always use MASTER_AUTO_POSITION=1 and the issue persisted.  It seems that something is killing the slaves connection before it is used.

I added the following block to server.py exec_query() after the assert statement and it seems to work around the issue.

if not self.db_conn.is_connected():
    logging.log(logging.INFO, "Re-establishing connection to execute query {0}: {1}".format(query_str, self.host))
    self.db_conn = self.get_connection()

Please specify the version of MySQL Connector/Python you are using.

mysql-connector-python-2.1.4-1.el7.x86_64

Hello  Owen,

Thank you for your feedback.
Could you please try with mysql-connector-python-2.1.3 version?

Thanks,
Chiranjeevi.

I redid my test with mysql-connector-python-2.1.3 and failover worked correctly.

Hello  Owen,

Thank you for your feedback.
This is most likely duplicate of Bug #83864, please see Bug #83864.

Thanks,
Chiranjeevi.

I do not have access to view Bug #83864