MySQL Bugs: #75886: successed to failover automatically,but failed to failover manually

Bug #75886	successed to failover automatically,but failed to failover manually
Submitted:	13 Feb 2015 3:51	Modified:	9 Apr 2015 12:31
Reporter:	lao zhao	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Fabric: High Availability	Severity:	S3 (Non-critical)
Version:	mysqlfabric 1.5.3	OS:	Linux (CentOS release 6.4 (Final))
Assigned to:		CPU Architecture:	Any
Tags:	usability

Description:
mysqlfabric 1.5.3
mysql server: mysql5.6.15-log

successed to failover automatically,but failed to failover manually

How to repeat:
# mysqlfabric group deactivate my_group

current:
10.58.46.143:7306(master)
10.58.46.143:7307(slave)
10.58.46.143:7308(slave)
10.58.46.143:7309(slave)

shutdown mysql master

# mysqlfabric group lookup_servers my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                         server_uuid           address    status       mode weight
------------------------------------ ----------------- --------- ---------- ------
1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306   PRIMARY READ_WRITE    1.0
b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY  READ_ONLY    1.0
b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY  READ_ONLY    1.0
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 SECONDARY  READ_ONLY    1.0

# mysqlfabric group health my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                                uuid is_alive    status is_not_running is_not_configured io_not_running sql_not_running                                                                                  io_error sql_error
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- ----------------------------------------------------------------------------------------- ---------
1d2a312e-a539-11e4-bd95-40a8f01f6f10        0    FAULTY              0                 0              0               0                                                                                     False     False
b445213a-a538-11e4-bd92-40a8f01f6f10        1 SECONDARY              0                 0              1               0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 2     False
b77574f3-b25b-11e4-933c-40a8f01f6f10        1 SECONDARY              0                 0              1               0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 2     False
ddfec00f-a540-11e4-bdc7-40a8f01f6f10        1 SECONDARY              0                 0              1               0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 2     False

issue
-----

failover manually:

# mysqlfabric group promote my_group --slave_id=b445213a-a538-11e4-bd92-40a8f01f6f10
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

ServerError: Server (b445213a-a538-11e4-bd92-40a8f01f6f10) is not a valid candidate slave due to the following reason: ({'sql_error': False, 'io_error': u"error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 19", 'io_not_running': True, 'sql_not_running': False, 'is_not_configured': False, 'is_not_running': False}).

# mysqlfabric group promote my_group --slave_id=ddfec00f-a540-11e4-bdc7-40a8f01f6f10
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

ServerError: Server (ddfec00f-a540-11e4-bdc7-40a8f01f6f10) is not a valid candidate slave due to the following reason: ({'sql_error': False, 'io_error': u"error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 20", 'io_not_running': True, 'sql_not_running': False, 'is_not_configured': False, 'is_not_running': False}).

# mysqlfabric group promote my_group --slave_id=b77574f3-b25b-11e4-933c-40a8f01f6f10
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

ServerError: Server (b77574f3-b25b-11e4-933c-40a8f01f6f10) is not a valid candidate slave due to the following reason: ({'sql_error': False, 'io_error': u"error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 20", 'io_not_running': True, 'sql_not_running': False, 'is_not_configured': False, 'is_not_running': False}).

# mysqlfabric group promote my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

GroupError: There is no valid candidate that can be automatically chosen in group (my_group). Please, choose one manually.

failed to failover all

# mysqlfabric group demote my_group   
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

DatabaseError: 2003: Can't connect to MySQL server on '10.58.46.143:7306' (111 Connection refused)

slave 10.58.46.143:7309 (slave_id=b77574f3-b25b-11e4-933c-40a8f01f6f10) executed: stop slave;

# mysqlfabric group promote my_group --slave_id=b77574f3-b25b-11e4-933c-40a8f01f6f10
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

ServerError: Server (b77574f3-b25b-11e4-933c-40a8f01f6f10) is not a valid candidate slave due to the following reason: ({'sql_error': False, 'io_error': u"error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 986", 'io_not_running': True, 'sql_not_running': True, 'is_not_configured': False, 'is_not_running': False}).

# mysqlfabric group promote my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

GroupError: There is no valid candidate that can be automatically chosen in group (my_group). Please, choose one manually.

slave 10.58.46.143:7309 (slave_id=b77574f3-b25b-11e4-933c-40a8f01f6f10) executed: reset slave all;

# mysqlfabric group promote my_group --slave_id=b77574f3-b25b-11e4-933c-40a8f01f6f10
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

ServerError: Server (b77574f3-b25b-11e4-933c-40a8f01f6f10) is not a valid candidate slave due to the following reason: ({'sql_error': False, 'io_error': False, 'io_not_running': False, 'sql_not_running': False, 'is_not_configured': True, 'is_not_running': False}).

# mysqlfabric group promote my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

GroupError: There is no valid candidate that can be automatically chosen in group (my_group). Please, choose one manually.

# mysqlfabric group health my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                                uuid is_alive    status is_not_running is_not_configured io_not_running sql_not_running                                                                                    io_error sql_error
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- ------------------------------------------------------------------------------------------- ---------
1d2a312e-a539-11e4-bd95-40a8f01f6f10        0    FAULTY              0                 0              0               0                                                                                       False     False
b445213a-a538-11e4-bd92-40a8f01f6f10        1 SECONDARY              0                 0              1               0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 996     False
b77574f3-b25b-11e4-933c-40a8f01f6f10        1 SECONDARY              0                 1              0               0                                                                                       False     False
ddfec00f-a540-11e4-bdc7-40a8f01f6f10        1 SECONDARY              0                 0              1               0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60  retries: 996     False

issue
-----

# mysqlfabric group lookup_servers my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                         server_uuid           address    status       mode weight
------------------------------------ ----------------- --------- ---------- ------
1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306   PRIMARY READ_WRITE    1.0
b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY  READ_ONLY    1.0
b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY  READ_ONLY    1.0
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 SECONDARY  READ_ONLY    1.0

# mysqlfabric group activate my_group

no problem,will failover automatically,

# mysqlfabric group health my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                                uuid is_alive    status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- ---------
1d2a312e-a539-11e4-bd95-40a8f01f6f10        0    FAULTY              0                 0              0               0    False     False
b445213a-a538-11e4-bd92-40a8f01f6f10        1 SECONDARY              0                 0              0               0    False     False
b77574f3-b25b-11e4-933c-40a8f01f6f10        1 SECONDARY              0                 0              0               0    False     False
ddfec00f-a540-11e4-bdc7-40a8f01f6f10        1   PRIMARY              0                 0              0               0    False     False

issue
-----

# mysqlfabric group lookup_servers my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                         server_uuid           address    status       mode weight
------------------------------------ ----------------- --------- ---------- ------
1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306    FAULTY READ_WRITE    1.0
b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY  READ_ONLY    1.0
b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY  READ_ONLY    1.0
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308   PRIMARY READ_WRITE    1.0

now,10.58.46.143:7308 is master

Hi Shenju,

Thank you for the bug report. In the first case, where you do a fail-over manually, the server have for some reason failed to connect to the master (it looks like a privilege problem), but in the second case you have SQL and I/O threads that are running with no error.

It would be good if you could attach the Fabric log file so that it is possible to figure out why you have the errors in the first case.

fabric_failed.log

Attachment: fabric_failed.log (application/octet-stream, text), 38.49 KiB.

I'm so sorry for during quite long time did not pay attention to this topic.I repeated the test.

# mysqlfabric group deactivate my_group

# mysqladmin shutdown primary node 10.58.46.143:7306

# mysqlfabric group lookup_servers my_group 
Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e 
Time-To-Live: 1 

server_uuid address status mode weight 
------------------------------------ ----------------- --------- ---------- ------ 
1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306 PRIMARY READ_WRITE 1.0 
b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY READ_ONLY 1.0 
b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY READ_ONLY 1.0 
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 SECONDARY READ_ONLY 1.0 

# mysqlfabric group health my_group 
Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e 
Time-To-Live: 1 

uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error 
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- ----------------------------------------------------------------------------------------- --------- 
1d2a312e-a539-11e4-bd95-40a8f01f6f10 0 FAULTY 0 0 0 0 False False 
b445213a-a538-11e4-bd92-40a8f01f6f10 1 SECONDARY 0 0 1 0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60 retries: 1 False 
b77574f3-b25b-11e4-933c-40a8f01f6f10 1 SECONDARY 0 0 1 0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60 retries: 1 False 
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 1 SECONDARY 0 0 1 0 error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60 retries: 1 False 

issue 
-----

# mysqlfabric group promote my_group 
Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e 
Time-To-Live: 1 

GroupError: There is no valid candidate that can be automatically chosen in group (my_group). Please, choose one manually. 

# mysqlfabric group promote my_group --slave_id=b445213a-a538-11e4-bd92-40a8f01f6f10 
Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e 
Time-To-Live: 1 

ServerError: Server (b445213a-a538-11e4-bd92-40a8f01f6f10) is not a valid candidate slave due to the following reason: ({'sql_error': False, 'io_error': u"error reconnecting to master 'user_fabric@10.58.46.143:7306' - retry-time: 60 retries: 8", 'io_not_running': True, 'sql_not_running': False, 'is_not_configured': False, 'is_not_running': False}).

I think the manual failover failed, because for each slave server IO THREAD is interrupted. but because of the current primary fails, so the IO THREAD interrupt is normal.

See the appendix for the corresponding log information(fabric_failed.log )

# mysqlfabric group activate my_group 

# mysqlfabric group lookup_servers my_group 
Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e 
Time-To-Live: 1 

server_uuid address status mode weight 
------------------------------------ ----------------- --------- ---------- ------ 
1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306 FAULTY READ_WRITE 1.0 
b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY READ_ONLY 1.0 
b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY READ_ONLY 1.0 
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308 PRIMARY READ_WRITE 1.0 

# mysqlfabric group health my_group 
Fabric UUID: 5ca1ab1e-a007-feed-f00d-cab3fe13249e 
Time-To-Live: 1 

uuid is_alive status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error 
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- --------- 
1d2a312e-a539-11e4-bd95-40a8f01f6f10 0 FAULTY 0 0 0 0 False False 
b445213a-a538-11e4-bd92-40a8f01f6f10 1 SECONDARY 0 0 0 0 False False 
b77574f3-b25b-11e4-933c-40a8f01f6f10 1 SECONDARY 0 0 0 0 False False 
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 1 PRIMARY 0 0 0 0 False False 

issue 
----- 

Fast automatic failover, elected the primary node 10.58.46.143:7308.

Personally think that the code path automatic and manual failover is different, the judgement conditions seems to be different.

Should not because privilege problem, because the automatic failover and manual failover privileges are exactly the same.

Deactivate mode, I want to try the following methods:
step1: mysqlfabric server set_status failed_old_master_server_id FAULTY
step2: mysqlfabric group promote my_group 
But failed.

The current deactivate mode:
# mysqlfabric group lookup_groups
group_id description failure_detector                          master_uuid
-------- ----------- ----------------    ------------------------------------
my_group        None                0      ddfec00f-a540-11e4-bdc7-40a8f01f6f10

# mysqlfabric group lookup_servers my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1
                         server_uuid           address    status       mode weight
------------------------------------ ----------------- --------- ---------- ------
1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306    FAULTY READ_WRITE    1.0
b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY  READ_ONLY    1.0
b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY  READ_ONLY    1.0
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308   PRIMARY READ_WRITE    1.0

# mysqlfabric group health my_group        
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1
                                uuid is_alive    status is_not_running is_not_configured io_not_running sql_not_running io_error sql_error
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- -------- ---------
1d2a312e-a539-11e4-bd95-40a8f01f6f10        0    FAULTY              0                 0              0               0    False     False
b445213a-a538-11e4-bd92-40a8f01f6f10        1 SECONDARY              0                 0              0               0    False     False
b77574f3-b25b-11e4-933c-40a8f01f6f10        1 SECONDARY              0                 0              0               0    False     False
ddfec00f-a540-11e4-bdc7-40a8f01f6f10        1   PRIMARY              0                 0              0               0    False     False

issue
-----

mysqladmin shutdown current primary db 10.58.46.143:7308

# mysqlfabric group health my_group 
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                                uuid is_alive    status is_not_running is_not_configured io_not_running sql_not_running                                                                                  io_error sql_error
------------------------------------ -------- --------- -------------- ----------------- -------------- --------------- ----------------------------------------------------------------------------------------- ---------
1d2a312e-a539-11e4-bd95-40a8f01f6f10        0    FAULTY              0                 0              0               0                                                                                     False     False
b445213a-a538-11e4-bd92-40a8f01f6f10        1 SECONDARY              0                 0              1               0 error reconnecting to master 'user_fabric@10.58.46.143:7308' - retry-time: 60  retries: 2     False
b77574f3-b25b-11e4-933c-40a8f01f6f10        1 SECONDARY              0                 0              1               0 error reconnecting to master 'user_fabric@10.58.46.143:7308' - retry-time: 60  retries: 2     False
ddfec00f-a540-11e4-bdc7-40a8f01f6f10        0    FAULTY              0                 0              0               0                                                                                     False     False

issue
-----

# mysqlfabric group lookup_servers my_group
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1

                         server_uuid           address    status       mode weight
------------------------------------ ----------------- --------- ---------- ------
1d2a312e-a539-11e4-bd95-40a8f01f6f10 10.58.46.143:7306    FAULTY READ_WRITE    1.0
b445213a-a538-11e4-bd92-40a8f01f6f10 10.58.46.143:7307 SECONDARY  READ_ONLY    1.0
b77574f3-b25b-11e4-933c-40a8f01f6f10 10.58.46.143:7309 SECONDARY  READ_ONLY    1.0
ddfec00f-a540-11e4-bdc7-40a8f01f6f10 10.58.46.143:7308   PRIMARY READ_WRITE    1.0

# mysqlfabric server set_status  ddfec00f-a540-11e4-bdc7-40a8f01f6f10 FAULTY
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1
ServerError: If you want to set a server (ddfec00f-a540-11e4-bdc7-40a8f01f6f10) to faulty, please, use the threat.report_faulty interface.

# mysqlfabric server set_status  ddfec00f-a540-11e4-bdc7-40a8f01f6f10 FAULTY --update_only
Fabric UUID:  5ca1ab1e-a007-feed-f00d-cab3fe13249e
Time-To-Live: 1
ServerError: If you want to set a server (ddfec00f-a540-11e4-bdc7-40a8f01f6f10) to faulty, please, use the threat.report_faulty interface.

Thanks shenju,

There is a reason for the difference: when doing an automatic fail-over, we do not want to accidentally promote a server that is fully functional, so the failure detector contact the master a few times before declaring it dead, and then executes a fail-over.

The key point here is that once the server is deemed dead, the failure detector does the equivalent of a:

    mysqlfabric threat report_failure <server>

The promote and demote operations are, on the other hand, expected to be done on fully functional servers, so before selecting a candidate to promote, the status of the slaves and the master are checked and an attempt to synchronize the slaves with the master is done. Since the master is faulty, but not marked faulty at this point, the command will fail.

The workaround is to use the above command instead to mark the faulty master as faulty and trigger a fail-over, but I think the main issue here is that the commands are not very clear.

It should be possible to execute a promote even if the master is faulty, perhaps requiring a special option to force health checks and trigger a fail-over if the master is deemed dead. Note that using this option would then potentially make the promote take a long time.

Note also that the "group health" command just does a cursory check of the servers, so even if it report the server as faulty, it could be a transient failure so it should not trigger a fail-over.

Thank you very much, this is the correct way.

Status updated to 'Won't fix' (Fabric is now covered under Oracle Lifetime Sustaining Support)