MySQL Bugs: #74555: MySQL Fabric hangs on network out on a master or slave

Bug #74555	MySQL Fabric hangs on network out on a master or slave
Submitted:	24 Oct 2014 17:32	Modified:	12 Dec 2014 23:06
Reporter:	cindy .	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Fabric	Severity:	S1 (Critical)
Version:	1.5.2	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
Triggering a network down on a master or slave in a MySQL Fabric group hangs Fabric for > 17 minutes before it fails over and promotes a slave to new master. 

Fabric detects the failure immediately in the error log:

[DEBUG] 1414170168.675818 - Executor-2 - Executing _report_failure
[DEBUG] 1414170168.675931 - Executor-2 - Statement (SELECT server_uuid, server_address, mode, status, weight, group_id FROM servers WHERE server_uuid = %s, Params(('3da4beeb-5af3-11e4-9943-024271b2de0a',)).
[WARNING] 1414170168.676645 - Executor-2 - Reported issue (FAULTY) for server (3da4beeb-5af3-11e4-9943-024271b2de0a).
[DEBUG] 1414170168.676775 - Executor-2 - Statement (INSERT INTO error_log (server_uuid, reported, reporter, error) VALUES(%s, %s, %s, %s), Params(('3da4beeb-5af3-11e4-9943-024271b2de0a', datetime.datetime(2014, 10, 24, 17, 2, 48), 'FailureDetector(tv_f_seg1)', 'FAULTY')).
[DEBUG] 1414170168.677313 - Executor-2 - Statement (UPDATE servers SET status = %s WHERE server_uuid = %s, Params((0, '3da4beeb-5af3-11e4-9943-024271b2de0a')).
[DEBUG] 1414170168.677813 - Executor-2 - Triggering event SERVER_LOST in handler <mysql.fabric.events.Handler object at 0x1df0e50>
[DEBUG] 1414170168.677898 - Executor-2 - Triggering event SERVER_LOST
[DEBUG] 1414170257.891030 - Thread-22 - purged 1 expired clients

It fails to promote the online slave to master, it reports slave as live:

            Slave_IO_Running: Yes
            Slave_SQL_Running: Yes

... until Fabric finally returns after 17 minutes.

How to repeat:
Run ifconfig eth0 down on a master or slave.

Run mysqlfabric group health tv_f_seg1 on the fabric server.

It will eventually mark the down server as FAULTY and promote the slave and return:

[root]# time mysqlfabric group health tv_f_seg1
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                                uuid is_alive    status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- ---------
3da4beeb-5af3-11e4-9943-024271b2de0a        0    FAULTY              0                 0              0               0    False     False
90022134-5ac8-11e4-982d-0a81eb8569e2        1 SECONDARY              0                 1              0               0    False     False

issue
-----

real	17m34.483s
user	0m0.108s
sys	0m0.012s

Suggested fix:
Honor unreachable_timeout in [servers] set in fabric.cfg.

Verified as described.

The problem here is that any function (e.g. built-in failure detector) that tracks whether a server is unreachable or not should kill all connections to a sever if it is considered faulty thus indirectly aborting/unblocking any command accessing the server.

This is not happening though and commands might get blocked until the TCP-IP or the MySQL Connection time out.

Posted by developer:
 
Fixed as of the upcoming MySQL Fabric 1.6.1 release, and here's the changelog entry:

Triggering a network down on a master or slave in a MySQL Fabric group
would hang Fabric for an extended amount of time before it failed over and
promoted a slave to a new master.

Thank you for the bug report.