MySQL Bugs: #73873: Locking error in the failure detector

Bug #73873	Locking error in the failure detector
Submitted:	10 Sep 2014 9:44	Modified:	30 Sep 2014 0:49
Reporter:	Alfranio Tavares Correia Junior	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Fabric	Severity:	S2 (Serious)
Version:		OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
The failure detector triggers a failover operation that should
not be allowed to concurrently run with other operations that
update the same group. In order to guarantee this, Fabric relies
on a locking mechanism that prevents concurrent updates to a
group.

Locks are acquired before an execution starts and are released at
its completion. Currently, Fabric serializes all update operations
due to problems in the mechanism that determines which groups should
be locked. One single lockable object is used: set(['lock']). See
BUG#72553: BUG#18712020 for further details.

However, the failure detector was specifying a set of groups as
lockable objects and a failover operation could potentially run
while a group was being updated by other operation and lead to
unpredictable results.

How to repeat:
Check the code.

Suggested fix:
=== modified file 'lib/mysql/fabric/failure_detector.py'
--- lib/mysql/fabric/failure_detector.py	revid:alfranio.correia@oracle.com-20140909162546-2titgr69irzhm7ia
+++ lib/mysql/fabric/failure_detector.py	2014-09-10 01:26:44 +0000
@@ -184,8 +184,8 @@
                             server, get_time()
                         )
                         if unstable and can_set_faulty:
-                            procedures = trigger("REPORT_FAILURE",
-                                set([self.__group_id]), str(server.uuid),
+                            procedures = trigger("REPORT_FAILURE", None,
+                                str(server.uuid),
                                 threading.current_thread().name,
                                 MySQLServer.FAULTY, False
                             )

Fixed as of the MySQL Utilities 1.5.2 release, and here's the changelog entry:

The failure detector was specifying a set of groups as lockable objects,
and a failover operation could potentially run while a group was being
updated by another operation which could lead to unpredictable results.

Thank you for the bug report.