MySQL Bugs: #72016: Automatic Failover is not happening when global group master is failed

Bug #72016	Automatic Failover is not happening when global group master is failed
Submitted:	12 Mar 2014 4:15	Modified:	26 Mar 2014 4:12
Reporter:	Rudra Patra	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Fabric	Severity:	S2 (Serious)
Version:		OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
crash test using killVM method. KillVM Simulates a power cable unplug.

Steps:

I have 6 groups and 3 shards (global group is my_group4)

I killed a VM where master of my_group1 and My_group4 are present.

I restarted the VM and all servers.

Then I checked the lookup_servers and did not see new master.

my_group4 lookup_servers:

Command :
{ success     = True
  return      = [{'status': 'SECONDARY', 'server_uuid': '2830361c-a926-11e3-91da-0021f6fab222', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven07:18016'}, {'status': 'SECONDARY', 'server_uuid': '29664d9b-a926-11e3-91da-0021f6fab223', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven08:18017'}, {'status': 'FAULTY', 'server_uuid': '2af70228-a926-11e3-91da-0021f6fab221', 'mode': 'READ_WRITE', 'weight': 1.0, 'address': 'kven06:18015'}, {'status': 'SECONDARY', 'server_uuid': '2b0a43a0-a926-11e3-91da-0021f6fab224', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven09:18018'}]
  activities  =
}

Where my_group4 activate ran fine

Procedure :
{ uuid        = 0b742ac9-6dac-44d9-ae75-9659d2ff94e1,
  finished    = True,
  success     = True,
  return      = True,
  activities  =
}

I removed the faulty server from my_group1 and added it again. When I did a promote I got below error.

my_group1 lookup_servers

Command :
{ success     = True
  return      = [{'status': 'SECONDARY', 'server_uuid': '274da6ba-a926-11e3-91da-0021f6fab222', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven07:18004'}, {'status': 'SECONDARY', 'server_uuid': '2936cfb4-a926-11e3-91da-0021f6fab223', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven08:18005'}, {'status': 'SECONDARY', 'server_uuid': '2a6c28e8-a926-11e3-91da-0021f6fab221', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven06:18003'}, {'status': 'SECONDARY', 'server_uuid': '2a8270bf-a926-11e3-91da-0021f6fab224', 'mode': 'READ_ONLY', 'weight': 1.0, 'address': 'kven09:18006'}]
  activities  =
}

promote:

Procedure :
{ uuid        = 0ca4727f-11a0-411c-948f-c3a8c8bf0248,
  finished    = True,
  success     = False,
  return      = GroupError: Group master not running my_group4,
  activities  =
} 

failover solution does not handle well the case that both the global and
a shard group have a failed master

How to repeat:
see above

Thank you for the bug report. Verified as described.

This is fixed as of the upcoming 1.4.2 release, and the changelog entry reads as:

        Failover was not handled properly when both the global and shard
        groups had a failed master.

Thank you for the bug report.