Bug #83684 mysqlfailover fails to run CHANGE MASTER against 5.7.16 server
Submitted: 4 Nov 2016 1:43 Modified: 28 Nov 2016 20:01
Reporter: Owen Owen Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Utilities Severity:S2 (Serious)
Version:1.6.4 OS:CentOS (7.2)
Assigned to: CPU Architecture:Any

[4 Nov 2016 1:43] Owen Owen
Description:
In a 3 node cluster with 1 master and 2 slaves, all running MySQL 5.7.16 on CentOS 7.2 - mysqlfailover does not failover.

mysqlfailover log:
2016-11-03 20:55:31 PM INFO Master may be down. Waiting for 3 seconds.
2016-11-03 20:55:58 PM INFO Failed to reconnect to the master after 3 attemps.
2016-11-03 20:55:58 PM CRITICAL Master is confirmed to be down or unreachable.
2016-11-03 20:55:58 PM INFO Failover starting in 'auto' mode...
2016-11-03 20:55:58 PM INFO Candidate slave 172.17.240.24:3307 will become the new master.
2016-11-03 20:55:58 PM INFO Checking slaves status (before failover).
2016-11-03 20:55:58 PM INFO Preparing candidate for failover.
2016-11-03 20:55:58 PM INFO Creating replication user if it does not exist.
2016-11-03 20:55:58 PM INFO Stopping slaves.
2016-11-03 20:55:58 PM INFO Performing STOP on all slaves.
2016-11-03 20:55:58 PM INFO Switching slaves to new master.
2016-11-03 20:55:58 PM INFO Unregistering instance on master.

Corresponding MySQL log.
2016-11-04T00:55:58.608175Z 67 [Note] Error reading relay log event for channel '': slave SQL thread was killed
2016-11-04T00:55:58.608507Z 66 [ERROR] Slave I/O for channel '': error reconnecting to master 'repl@172.17.48.26:3307' - retry-time: 60  retries: 1, Error_code: 2003
2016-11-04T00:55:58.608517Z 66 [Note] Slave I/O thread killed during or after a reconnect done to recover from failed read
2016-11-04T00:55:58.608523Z 66 [Note] Slave I/O thread exiting for channel '', read up to log 'mysql-bin.000006', position 194
2016-11-04T00:55:58.619206Z 502 [Note] Aborted connection 502 to db: 'unconnected' user: 'root' host: '172.17.176.18' (Got an error reading communication packets)

In my observation, what happens is that the server.py supports_gtid() method attempts to use exec_query to run SELECT @@GLOBAL.GTID_MODE.  However, the connection is aborted.  (See the last line of MySQL error log)  Since the connection is aborted, supports_gtid() ends up returning "NO."  This means that replication.py make_change_master ends up using MASTER_LOG_FILE = instead of MASTER_AUTO_POSITION=1.  This causes line 2208 of replication.py to throw an exception because since the server is using GTID, MASTER_LOG_FILE in CHANGE MASTER fails.

The end result is failover with mysqlfailover and 5.7.16 is broken.  I can reproduce 100% of the time on servers with 5.7.16 but servers with 5.7.15 have no issues.

How to repeat:
Create a 3 node cluster with 1 master and 2 slaves on CentOS 7.2 and MySQL 5.7.16.  Use iptables to cutoff the master from the slaves.  Observe mysqlfailover not work.

Suggested fix:
Find out why MySQL aborts the connection that is used to find out if GTID is on.
[4 Nov 2016 2:03] Owen Owen
I modified replication.py to always use MASTER_AUTO_POSITION=1 and the issue persisted.  It seems that something is killing the slaves connection before it is used.
[4 Nov 2016 2:28] Owen Owen
I added the following block to server.py exec_query() after the assert statement and it seems to work around the issue.

if not self.db_conn.is_connected():
    logging.log(logging.INFO, "Re-establishing connection to execute query {0}: {1}".format(query_str, self.host))
    self.db_conn = self.get_connection()
[21 Nov 2016 19:31] Chuck Bell
Please specify the version of MySQL Connector/Python you are using.
[21 Nov 2016 19:34] Owen Owen
mysql-connector-python-2.1.4-1.el7.x86_64
[22 Nov 2016 11:53] Chiranjeevi Battula
Hello  Owen,

Thank you for your feedback.
Could you please try with mysql-connector-python-2.1.3 version?

Thanks,
Chiranjeevi.
[22 Nov 2016 19:46] Owen Owen
I redid my test with mysql-connector-python-2.1.3 and failover worked correctly.
[24 Nov 2016 7:04] Chiranjeevi Battula
Hello  Owen,

Thank you for your feedback.
This is most likely duplicate of Bug #83864, please see Bug #83864.

Thanks,
Chiranjeevi.
[28 Nov 2016 20:01] Owen Owen
I do not have access to view Bug #83864