Bug #75886 | successed to failover automatically,but failed to failover manually | ||
---|---|---|---|
Submitted: | 13 Feb 2015 3:51 | Modified: | 9 Apr 2015 12:31 |
Reporter: | lao zhao | Email Updates: | |
Status: | Won't fix | Impact on me: | |
Category: | MySQL Fabric: High Availability | Severity: | S3 (Non-critical) |
Version: | mysqlfabric 1.5.3 | OS: | Linux (CentOS release 6.4 (Final)) |
Assigned to: | CPU Architecture: | Any | |
Tags: | usability |
[13 Feb 2015 3:51]
lao zhao
[13 Feb 2015 7:03]
lao zhao
# mysqlfabric group activate my_group no problem,will failover automatically, # mysqlfabric group health my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error ------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- --------- 1d2a312e-a539-11e4-bd95-40a8f01f6f10 0 FAULTY 0 0 0 0 False False b445213a-a538-11e4-bd92-40a8f01f6f10 1 SECONDARY 0 0 0 0 False False b77574f3-b25b-11e4-933c-40a8f01f6f10 1 SECONDARY 0 0 0 0 False False ddfec00f-a540-11e4-bdc7-40a8f01f6f10 1 PRIMARY 0 0 0 0 False False issue ----- # mysqlfabric group lookup_servers my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 server_uuid address status mode weight ------------------------------------ ----------------- --------- ---------- ------ 1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306 FAULTY READ_WRITE 1.0 b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY READ_ONLY 1.0 b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY READ_ONLY 1.0 ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 PRIMARY READ_WRITE 1.0 now,10.58.46.143:7308 is master
[16 Feb 2015 13:45]
Mats Kindahl
Hi Shenju, Thank you for the bug report. In the first case, where you do a fail-over manually, the server have for some reason failed to connect to the master (it looks like a privilege problem), but in the second case you have SQL and I/O threads that are running with no error. It would be good if you could attach the Fabric log file so that it is possible to figure out why you have the errors in the first case.
[3 Mar 2015 8:08]
lao zhao
fabric_failed.log
Attachment: fabric_failed.log (application/octet-stream, text), 38.49 KiB.
[3 Mar 2015 8:09]
lao zhao
I'm so sorry for during quite long time did not pay attention to this topic.I repeated the test. # mysqlfabric group deactivate my_group # mysqladmin shutdown primary node 10.58.46.143:7306 # mysqlfabric group lookup_servers my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 server_uuid address status mode weight ------------------------------------ ----------------- --------- ---------- ------ 1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306 PRIMARY READ_WRITE 1.0 b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY READ_ONLY 1.0 b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY READ_ONLY 1.0 ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 SECONDARY READ_ONLY 1.0 # mysqlfabric group health my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error ------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- ----------------------------------------------------------------------------------------- --------- 1d2a312e-a539-11e4-bd95-40a8f01f6f10 0 FAULTY 0 0 0 0 False False b445213a-a538-11e4-bd92-40a8f01f6f10 1 SECONDARY 0 0 1 0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60 retries: 1 False b77574f3-b25b-11e4-933c-40a8f01f6f10 1 SECONDARY 0 0 1 0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60 retries: 1 False ddfec00f-a540-11e4-bdc7-40a8f01f6f10 1 SECONDARY 0 0 1 0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60 retries: 1 False issue ----- # mysqlfabric group promote my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 GroupError: There is no valid candidate that can be automatically chosen in group (my_group). Please, choose one manually. # mysqlfabric group promote my_group --slave_id=b445213a-a538-11e4-bd92-40a8f01f6f10 Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 ServerError: Server (b445213a-a538-11e4-bd92-40a8f01f6f10) is not a valid candidate slave due to the following reason: ({'sql_error': False, 'io_error': u"error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60 retries: 8", 'io_not_running': True, 'sql_not_running': False, 'is_not_configured': False, 'is_not_running': False}). I think the manual failover failed, because for each slave server IO THREAD is interrupted. but because of the current primary fails, so the IO THREAD interrupt is normal. See the appendix for the corresponding log information(fabric_failed.log ) # mysqlfabric group activate my_group # mysqlfabric group lookup_servers my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 server_uuid address status mode weight ------------------------------------ ----------------- --------- ---------- ------ 1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306 FAULTY READ_WRITE 1.0 b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY READ_ONLY 1.0 b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY READ_ONLY 1.0 ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 PRIMARY READ_WRITE 1.0 # mysqlfabric group health my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error ------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- --------- 1d2a312e-a539-11e4-bd95-40a8f01f6f10 0 FAULTY 0 0 0 0 False False b445213a-a538-11e4-bd92-40a8f01f6f10 1 SECONDARY 0 0 0 0 False False b77574f3-b25b-11e4-933c-40a8f01f6f10 1 SECONDARY 0 0 0 0 False False ddfec00f-a540-11e4-bdc7-40a8f01f6f10 1 PRIMARY 0 0 0 0 False False issue ----- Fast automatic failover, elected the primary node 10.58.46.143:7308. Personally think that the code path automatic and manual failover is different, the judgement conditions seems to be different. Should not because privilege problem, because the automatic failover and manual failover privileges are exactly the same.
[3 Mar 2015 16:02]
lao zhao
Deactivate mode, I want to try the following methods: step1: mysqlfabric server set_status failed_old_master_server_id FAULTY step2: mysqlfabric group promote my_group But failed. The current deactivate mode: # mysqlfabric group lookup_groups group_id description failure_detector master_uuid -------- ----------- ---------------- ------------------------------------ my_group None 0 ddfec00f-a540-11e4-bdc7-40a8f01f6f10 # mysqlfabric group lookup_servers my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 server_uuid address status mode weight ------------------------------------ ----------------- --------- ---------- ------ 1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306 FAULTY READ_WRITE 1.0 b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY READ_ONLY 1.0 b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY READ_ONLY 1.0 ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 PRIMARY READ_WRITE 1.0 # mysqlfabric group health my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error ------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- --------- 1d2a312e-a539-11e4-bd95-40a8f01f6f10 0 FAULTY 0 0 0 0 False False b445213a-a538-11e4-bd92-40a8f01f6f10 1 SECONDARY 0 0 0 0 False False b77574f3-b25b-11e4-933c-40a8f01f6f10 1 SECONDARY 0 0 0 0 False False ddfec00f-a540-11e4-bdc7-40a8f01f6f10 1 PRIMARY 0 0 0 0 False False issue ----- mysqladmin shutdown current primary db 10.58.46.143:7308 # mysqlfabric group health my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error ------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- ----------------------------------------------------------------------------------------- --------- 1d2a312e-a539-11e4-bd95-40a8f01f6f10 0 FAULTY 0 0 0 0 False False b445213a-a538-11e4-bd92-40a8f01f6f10 1 SECONDARY 0 0 1 0 error reconnecting to master 'user_fabric@10.58.46.143:7308' - retry-time: 60 retries: 2 False b77574f3-b25b-11e4-933c-40a8f01f6f10 1 SECONDARY 0 0 1 0 error reconnecting to master 'user_fabric@10.58.46.143:7308' - retry-time: 60 retries: 2 False ddfec00f-a540-11e4-bdc7-40a8f01f6f10 0 FAULTY 0 0 0 0 False False issue ----- # mysqlfabric group lookup_servers my_group Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 server_uuid address status mode weight ------------------------------------ ----------------- --------- ---------- ------ 1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306 FAULTY READ_WRITE 1.0 b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY READ_ONLY 1.0 b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY READ_ONLY 1.0 ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 PRIMARY READ_WRITE 1.0 # mysqlfabric server set_status ddfec00f-a540-11e4-bdc7-40a8f01f6f10 FAULTY Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 ServerError: If you want to set a server (ddfec00f-a540-11e4-bdc7-40a8f01f6f10) to faulty, please, use the threat.report_faulty interface. # mysqlfabric server set_status ddfec00f-a540-11e4-bdc7-40a8f01f6f10 FAULTY --update_only Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e Time-To-Live: 1 ServerError: If you want to set a server (ddfec00f-a540-11e4-bdc7-40a8f01f6f10) to faulty, please, use the threat.report_faulty interface.
[24 Mar 2015 8:50]
Mats Kindahl
Thanks shenju, There is a reason for the difference: when doing an automatic fail-over, we do not want to accidentally promote a server that is fully functional, so the failure detector contact the master a few times before declaring it dead, and then executes a fail-over. The key point here is that once the server is deemed dead, the failure detector does the equivalent of a: mysqlfabric threat report_failure <server> The promote and demote operations are, on the other hand, expected to be done on fully functional servers, so before selecting a candidate to promote, the status of the slaves and the master are checked and an attempt to synchronize the slaves with the master is done. Since the master is faulty, but not marked faulty at this point, the command will fail. The workaround is to use the above command instead to mark the faulty master as faulty and trigger a fail-over, but I think the main issue here is that the commands are not very clear. It should be possible to execute a promote even if the master is faulty, perhaps requiring a special option to force health checks and trigger a fail-over if the master is deemed dead. Note that using this option would then potentially make the promote take a long time. Note also that the "group health" command just does a cursory check of the servers, so even if it report the server as faulty, it could be a transient failure so it should not trigger a fail-over.
[9 Apr 2015 12:31]
lao zhao
Thank you very much, this is the correct way.
[6 Jul 2017 19:19]
Bugs System
Status updated to 'Won't fix' (Fabric is now covered under Oracle Lifetime Sustaining Support)