Bug #73873 Locking error in the failure detector
Submitted: 10 Sep 2014 9:44 Modified: 30 Sep 2014 0:49
Reporter: Alfranio Junior Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Fabric Severity:S2 (Serious)
Version: OS:Any
Assigned to: CPU Architecture:Any

[10 Sep 2014 9:44] Alfranio Junior
Description:
The failure detector triggers a failover operation that should
not be allowed to concurrently run with other operations that
update the same group. In order to guarantee this, Fabric relies
on a locking mechanism that prevents concurrent updates to a
group.

Locks are acquired before an execution starts and are released at
its completion. Currently, Fabric serializes all update operations
due to problems in the mechanism that determines which groups should
be locked. One single lockable object is used: set(['lock']). See
BUG#72553: BUG#18712020 for further details.

However, the failure detector was specifying a set of groups as
lockable objects and a failover operation could potentially run
while a group was being updated by other operation and lead to
unpredictable results.

How to repeat:
Check the code.

Suggested fix:
=== modified file 'lib/mysql/fabric/failure_detector.py'
--- lib/mysql/fabric/failure_detector.py	revid:alfranio.correia@oracle.com-20140909162546-2titgr69irzhm7ia
+++ lib/mysql/fabric/failure_detector.py	2014-09-10 01:26:44 +0000
@@ -184,8 +184,8 @@
                             server, get_time()
                         )
                         if unstable and can_set_faulty:
-                            procedures = trigger("REPORT_FAILURE",
-                                set([self.__group_id]), str(server.uuid),
+                            procedures = trigger("REPORT_FAILURE", None,
+                                str(server.uuid),
                                 threading.current_thread().name,
                                 MySQLServer.FAULTY, False
                             )
[30 Sep 2014 0:49] Philip Olson
Fixed as of the MySQL Utilities 1.5.2 release, and here's the changelog entry:

The failure detector was specifying a set of groups as lockable objects,
and a failover operation could potentially run while a group was being
updated by another operation which could lead to unpredictable results.

Thank you for the bug report.