Description:
Triggering a network down on a master or slave in a MySQL Fabric group hangs Fabric for > 17 minutes before it fails over and promotes a slave to new master.
Fabric detects the failure immediately in the error log:
[DEBUG] 1414170168.675818 - Executor-2 - Executing _report_failure
[DEBUG] 1414170168.675931 - Executor-2 - Statement (SELECT server_uuid, server_address, mode, status, weight, group_id FROM servers WHERE server_uuid = %s, Params(('3da4beeb-5af3-11e4-9943-024271b2de0a',)).
[WARNING] 1414170168.676645 - Executor-2 - Reported issue (FAULTY) for server (3da4beeb-5af3-11e4-9943-024271b2de0a).
[DEBUG] 1414170168.676775 - Executor-2 - Statement (INSERT INTO error_log (server_uuid, reported, reporter, error) VALUES(%s, %s, %s, %s), Params(('3da4beeb-5af3-11e4-9943-024271b2de0a', datetime.datetime(2014, 10, 24, 17, 2, 48), 'FailureDetector(tv_f_seg1)', 'FAULTY')).
[DEBUG] 1414170168.677313 - Executor-2 - Statement (UPDATE servers SET status = %s WHERE server_uuid = %s, Params((0, '3da4beeb-5af3-11e4-9943-024271b2de0a')).
[DEBUG] 1414170168.677813 - Executor-2 - Triggering event SERVER_LOST in handler <mysql.fabric.events.Handler object at 0x1df0e50>
[DEBUG] 1414170168.677898 - Executor-2 - Triggering event SERVER_LOST
[DEBUG] 1414170257.891030 - Thread-22 - purged 1 expired clients
It fails to promote the online slave to master, it reports slave as live:
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
... until Fabric finally returns after 17 minutes.
How to repeat:
Run ifconfig eth0 down on a master or slave.
Run mysqlfabric group health tv_f_seg1 on the fabric server.
It will eventually mark the down server as FAULTY and promote the slave and return:
[root]# time mysqlfabric group health tv_f_seg1
Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1
uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- ---------
3da4beeb-5af3-11e4-9943-024271b2de0a 0 FAULTY 0 0 0 0 False False
90022134-5ac8-11e4-982d-0a81eb8569e2 1 SECONDARY 0 1 0 0 False False
issue
-----
real 17m34.483s
user 0m0.108s
sys 0m0.012s
Suggested fix:
Honor unreachable_timeout in [servers] set in fabric.cfg.
Description: Triggering a network down on a master or slave in a MySQL Fabric group hangs Fabric for > 17 minutes before it fails over and promotes a slave to new master. Fabric detects the failure immediately in the error log: [DEBUG] 1414170168.675818 - Executor-2 - Executing _report_failure [DEBUG] 1414170168.675931 - Executor-2 - Statement (SELECT server_uuid, server_address, mode, status, weight, group_id FROM servers WHERE server_uuid = %s, Params(('3da4beeb-5af3-11e4-9943-024271b2de0a',)). [WARNING] 1414170168.676645 - Executor-2 - Reported issue (FAULTY) for server (3da4beeb-5af3-11e4-9943-024271b2de0a). [DEBUG] 1414170168.676775 - Executor-2 - Statement (INSERT INTO error_log (server_uuid, reported, reporter, error) VALUES(%s, %s, %s, %s), Params(('3da4beeb-5af3-11e4-9943-024271b2de0a', datetime.datetime(2014, 10, 24, 17, 2, 48), 'FailureDetector(tv_f_seg1)', 'FAULTY')). [DEBUG] 1414170168.677313 - Executor-2 - Statement (UPDATE servers SET status = %s WHERE server_uuid = %s, Params((0, '3da4beeb-5af3-11e4-9943-024271b2de0a')). [DEBUG] 1414170168.677813 - Executor-2 - Triggering event SERVER_LOST in handler <mysql.fabric.events.Handler object at 0x1df0e50> [DEBUG] 1414170168.677898 - Executor-2 - Triggering event SERVER_LOST [DEBUG] 1414170257.891030 - Thread-22 - purged 1 expired clients It fails to promote the online slave to master, it reports slave as live: Slave_IO_Running: Yes Slave_SQL_Running: Yes ... until Fabric finally returns after 17 minutes. How to repeat: Run ifconfig eth0 down on a master or slave. Run mysqlfabric group health tv_f_seg1 on the fabric server. It will eventually mark the down server as FAULTY and promote the slave and return: [root]# time mysqlfabric group health tv_f_seg1 Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error ------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- --------- 3da4beeb-5af3-11e4-9943-024271b2de0a 0 FAULTY 0 0 0 0 False False 90022134-5ac8-11e4-982d-0a81eb8569e2 1 SECONDARY 0 1 0 0 False False issue ----- real 17m34.483s user 0m0.108s sys 0m0.012s Suggested fix: Honor unreachable_timeout in [servers] set in fabric.cfg.