MySQL Bugs: #76856: mysqlfailover not fail over to slave for continue master failing

Bug #76856	mysqlfailover not fail over to slave for continue master failing
Submitted:	27 Apr 2015 17:02	Modified:	2 Jan 2017 9:55
Reporter:	Benjamin Lin	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Utilities	Severity:	S3 (Non-critical)
Version:	MySQL 5.6.23 Enterprise	OS:	Linux (CentOS 6)
Assigned to:		CPU Architecture:	Any

Description:
when I use mysqlfailover tool and specify multiple candidates in the command line, it only did failover once for first master failing. When I in testing and force the new master ( after first failing ) failing again, the tool stop with following error:

Failed to reconnect to the master after 3 attemps.

Failover starting in 'auto' mode...
2015-04-27 12:03:44 PM CRITICAL Can't connect to MySQL server on 'mysqlslave2:3307' (111 Connection refused)
ERROR: Can't connect to MySQL server on 'mysqlslave2:3307' (111 Connection refused)

And in the daemon mode, the error log shows:
2015-04-27 12:46:17 PM INFO Master may be down. Waiting for 3 seconds.
2015-04-27 12:46:32 PM INFO Failed to reconnect to the master after 3 attemps.
2015-04-27 12:46:32 PM CRITICAL Master is confirmed to be down or unreachable.
2015-04-27 12:46:32 PM INFO Failover starting in 'auto' mode...
2015-04-27 12:46:32 PM INFO Unregistering instance on master

However, if I don't specify 'candidates' option, it will continue failover to a healthy slave if the current master keeps failing. ( so first master fail, failover success, and then the new master fail too, failover success again )

How to repeat:
here is my command option:
mysqlfailover --master=root:VsonSql@mysqlmaster:3306 --discover-slaves-login=root:xxx --candidates=root:xxx@mysqlslave1:3307,root:xxx@mysqlslave2:3307,root:xxx@mysqlslave2:3307

and I tested it with and w/o --daemon=start

I have observed similar behavior.  In my case, if I specify "auto" mode with a candidates list, the node will not failover at all when the master is shut down.  It registers the master going down, but it never picks another candidate.  If I remove the list of candidates, "auto" failover will select a slave from the detected slave registry and fail over to one successfully.  The candidates list does function properly in "elect" mode, but not in "auto" mode.

I have the same problem but I am using the discover slaves login:

mysqlfailover --master=manager:password@acs-sql-101 --discover-slaves-login=manager:password --daemon=start --log=/var/log/mysql-failover --rpl-user=repl:password --verbose

Version: MySQL Utilities mysqlfailover version 1.5.6

The log output:

Before the service mysql stop

2016-08-17 19:52:01 PM INFO Discovering slaves for master at acs-sql-101:3306
2016-08-17 19:52:01 PM INFO Discovering slave at acs-sql-201:3306
2016-08-17 19:52:01 PM INFO Discovering slave at acs-sql-202:3306
2016-08-17 19:52:01 PM INFO Master Information
2016-08-17 19:52:01 PM INFO Binary Log File: acs-sql-101.000003, Position: 3783556, Binlog_Do_DB: N/A, Binlog_Ignore_DB: N/A
2016-08-17 19:52:01 PM INFO GTID Executed Set: b4ef9624-0627-11e6-bd65-daea6468c513:1-744906
2016-08-17 19:52:01 PM INFO Getting health for master: acs-sql-101:3306.
2016-08-17 19:52:01 PM INFO Health Status:
2016-08-17 19:52:01 PM INFO host: acs-sql-101, port: 3306, role: MASTER, state: UP, gtid_mode: ON, health: OK, version: 5.7.14-log, master_log_file: acs-sql-101.000003, master_log_pos: 3783556, IO_Thread: , SQL_Thread: , Secs_Behind: , Remaining_Delay: , IO_Error_Num: , IO_Error: , SQL_Error_Num: , SQL_Error: , Trans_Behind: 
2016-08-17 19:52:01 PM INFO host: acs-sql-201, port: 3306, role: SLAVE, state: UP, gtid_mode: ON, health: OK, version: 5.7.14-log, master_log_file: acs-sql-101.000003, master_log_pos: 3783556, IO_Thread: Yes, SQL_Thread: Yes, Secs_Behind: 0, Remaining_Delay: No, IO_Error_Num: 0, IO_Error: , SQL_Error_Num: 0, SQL_Error: , Trans_Behind: 0
2016-08-17 19:52:01 PM INFO host: acs-sql-202, port: 3306, role: SLAVE, state: UP, gtid_mode: ON, health: OK, version: 5.7.14-log, master_log_file: acs-sql-101.000003, master_log_pos: 3783556, IO_Thread: Yes, SQL_Thread: Yes, Secs_Behind: 0, Remaining_Delay: No, IO_Error_Num: 0, IO_Error: , SQL_Error_Num: 0, SQL_Error: , Trans_Behind: 0

After the stop command

2016-08-17 19:52:25 PM INFO Master may be down. Waiting for 3 seconds.
2016-08-17 19:52:40 PM INFO Failed to reconnect to the master after 3 attemps.
2016-08-17 19:52:40 PM CRITICAL Master is confirmed to be down or unreachable.
2016-08-17 19:52:40 PM INFO Failover starting in 'auto' mode...
2016-08-17 19:52:40 PM INFO Checking eligibility of slave acs-sql-201:3306 for candidate.
2016-08-17 19:52:40 PM INFO GTID_MODE=ON ... Ok
2016-08-17 19:52:40 PM INFO Replication user exists ... Ok
2016-08-17 19:52:40 PM INFO Unregistering instance on master.

Before this test i had the same test without the rpl-user information which showed:

2016-08-17 19:12:30 PM INFO Failover starting in 'auto' mode...
2016-08-17 19:12:30 PM INFO Unregistering instance on master.

Hello Benjamin,

Thank you for the bug report.
Could you please try with C/Python 2.1.3 (not 2.1.4), Utilities 1.6.4, MySQL 5.7 GA and let us know if you are still having the issue with complete repeatable steps(sample test case and configuration file , etc. - please make it as private if you prefer) to confirm this issue at our end?

Thanks,
Chiranjeevi.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".