Bug #72016 Automatic Failover is not happening when global group master is failed
Submitted: 12 Mar 2014 4:15 Modified: 26 Mar 2014 4:12
Reporter: Rudra Patra Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Fabric Severity:S2 (Serious)
Version: OS:Linux
Assigned to: CPU Architecture:Any

[12 Mar 2014 4:15] Rudra Patra
Description:
crash test using killVM method. KillVM Simulates a power cable unplug.

Steps:

I have 6 groups and 3 shards (global group is my_group4)

I killed a VM where master of my_group1 and My_group4 are present.

I restarted the VM and all servers.

Then I checked the lookup_servers and did not see new master.

my_group4 lookup_servers:

Command :
{ success     = True
  return      = [{'status': 'SECONDARY', 'server_uuid': '2830361c-a926-11e3-91da-0021f6fab222', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven07:18016'}, {'status': 'SECONDARY', 'server_uuid': '29664d9b-a926-11e3-91da-0021f6fab223', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven08:18017'}, {'status': 'FAULTY', 'server_uuid': '2af70228-a926-11e3-91da-0021f6fab221', 'mode': 'READ_WRITE', 'weight': 1.0, 'address': 'kven06:18015'}, {'status': 'SECONDARY', 'server_uuid': '2b0a43a0-a926-11e3-91da-0021f6fab224', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven09:18018'}]
  activities  =
}

Where my_group4 activate ran fine

Procedure :
{ uuid        = 0b742ac9-6dac-44d9-ae75-9659d2ff94e1,
  finished    = True,
  success     = True,
  return      = True,
  activities  =
}

I removed the faulty server from my_group1 and added it again. When I did a promote I got below error.

my_group1 lookup_servers

Command :
{ success     = True
  return      = [{'status': 'SECONDARY', 'server_uuid': '274da6ba-a926-11e3-91da-0021f6fab222', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven07:18004'}, {'status': 'SECONDARY', 'server_uuid': '2936cfb4-a926-11e3-91da-0021f6fab223', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven08:18005'}, {'status': 'SECONDARY', 'server_uuid': '2a6c28e8-a926-11e3-91da-0021f6fab221', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven06:18003'}, {'status': 'SECONDARY', 'server_uuid': '2a8270bf-a926-11e3-91da-0021f6fab224', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven09:18006'}]
  activities  =
}

promote:

Procedure :
{ uuid        = 0ca4727f-11a0-411c-948f-c3a8c8bf0248,
  finished    = True,
  success     = False,
  return      = GroupError: Group master not running my_group4,
  activities  =
} 

failover solution does not handle well the case that both the global and
a shard group have a failed master

How to repeat:
see above
[14 Mar 2014 13:11] Alfranio Tavares Correia Junior
Thank you for the bug report. Verified as described.
[26 Mar 2014 4:12] Philip Olson
This is fixed as of the upcoming 1.4.2 release, and the changelog entry reads as:

        Failover was not handled properly when both the global and shard
        groups had a failed master.

Thank you for the bug report.