Bug #74011 Race condition between ::performReceive and ::update_connections()
Submitted: 22 Sep 2014 12:27 Modified: 17 Oct 2014 12:58
Reporter: Ole John Aske Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:7.3.6 OS:Any
Assigned to: CPU Architecture:Any

[22 Sep 2014 12:27] Ole John Aske
Description:
There is a basic requirement in the design that TransporterRegistry::performReceive()
and ::update_connection() should not be run concurrently on the same set of
Transporters. This is due to ::update_connections() is doing the final cleanup
and re-init of buffers being used by ::performReceive(). Changing these
while a concurrent ::performReceive() read/writes the same buffers may cause
garbage or inconsistent signals being read or written.

Prior to WL#3860 (ATC patches) TransporterFacade::theMutexPtr was used to
protect these methods from concurrently modifying the same Transporter.
That WL removed that mutex without providing other protection for
concurrent usage of these methods, and thus introduced a regression.

How to repeat:
Apply this patch which detects concurrent usage of the
two methods in 'subject', then run MTR test '--suite=ndb'
for a short while:

=== modified file 'storage/ndb/include/transporter/TransporterCallback.hpp'
--- storage/ndb/include/transporter/TransporterCallback.hpp	revid:ole.john.aske@oracle.com-20140911090125-000jtrm6dz3wj6e0
+++ storage/ndb/include/transporter/TransporterCallback.hpp	2014-09-22 06:49:37 +0000
@@ -110,7 +110,17 @@
   /**
    *
    */

+#ifndef NDEBUG
+  /**
+   * 'm_active' is used by 'class TransporterReceiveWatchdog' in 
+   * DEBUG to detect concurrent calls to ::update_connections and
+   * ::performReceive() which isn't allowed.
+   */
+  TransporterReceiveHandle() : m_active(false) {};
+  bool m_active;
+#endif
 };
 
 /**

=== modified file 'storage/ndb/src/common/transporter/TransporterRegistry.cpp'
--- storage/ndb/src/common/transporter/TransporterRegistry.cpp	revid:ole.john.aske@oracle.com-20140911090125-000jtrm6dz3wj6e0
+++ storage/ndb/src/common/transporter/TransporterRegistry.cpp	2014-09-22 07:02:56 +0000
@@ -50,6 +50,42 @@
 #include <EventLogger.hpp>
 extern EventLogger * g_eventLogger;
 
+/**
+ * There is a requirement in the Transporter design that
+ * ::performReceive() and ::update_connections()
+ * on the same 'TransporterReceiveHandle' should not be 
+ * run concurrently. class TransporterReceiveWatchdog provides a
+ * simple mechanism to assert that this rule is followed.
+ * Does nothing if NDEBUG is defined (in production code)
+ */
+class TransporterReceiveWatchdog
+{
+public:
+#if NDEBUG 
+  TransporterReceiveWatchdog(TransporterReceiveHandle& recvdata)
+  {}
+
+#else
+  TransporterReceiveWatchdog(TransporterReceiveHandle& recvdata)
+    : m_recvdata(recvdata)
+  {
+    assert(m_recvdata.m_active == false);
+    m_recvdata.m_active = true;
+  }
+
+  ~TransporterReceiveWatchdog()
+  {
+    assert(m_recvdata.m_active == true);
+    m_recvdata.m_active = false;
+  }
+
+private:
+  TransporterReceiveHandle& m_recvdata;
+#endif
+};
+
 struct in_addr
 TransporterRegistry::get_connect_address(NodeId node_id) const
 {
@@ -1365,6 +1401,7 @@
 Uint32
 TransporterRegistry::performReceive(TransporterReceiveHandle& recvdata)
 {
+  TransporterReceiveWatchdog guard(recvdata);
   assert((receiveHandle == &recvdata) || (receiveHandle == 0));
 
   if (recvdata.m_recv_transporters.get(0))
@@ -1905,6 +1942,7 @@
 void
 TransporterRegistry::update_connections(TransporterReceiveHandle& recvdata)
 {
+  TransporterReceiveWatchdog guard(recvdata);
   assert((receiveHandle == &recvdata) || (receiveHandle == 0));
 
   for (int i= 0, n= 0; n < nTransporters; i++){
[1 Oct 2014 13:07] Ole John Aske
Posted by developer:
 
Added Mikael R. as second reviewer as he reviewed pr maillist and 
replied:
----------------------

Hi,
Reviewed and looks ok for me.

Rgrds Mikael
[16 Oct 2014 11:10] Ole John Aske
Posted by developer:
 
Note to Doc-team:

Even if (parts of) this fix is pushed 7.1->, the real
fix is only present from 7.3-> as that is where the
regression also was introduced.
[17 Oct 2014 12:58] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

Documented fix as follows in the NDB 7.3.8 changelog:

    A basic requirement of the NDB storage engine's design is that
    the transporter registry not attempt to receive data
    (TransporterRegistry::performReceive()) from and update the
    connection status (TransporterRegistry::update_connections()) of
    the same set of transporters concurrently, due to the fact that
    the updates perform final cleanup and reinitialization of
    buffers used when receiving data. Changing the contents of these
    buffers while reading or writing to them could lead to "garbage"
    or inconsistent signals being read or written.

    During the course of work done previously to improve the
    implementation of the transporter facade, a mutex intended to
    protect against the concurrent use of the performReceive() and
    update_connections()) methods on the same transporter was
    inadvertently removed. This fix adds a watchdog check for
    concurrent usage. In addition, update_connections() and
    performReceive() calls are now serialized together while polling
    the transporters.

Closed.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html