| Bug #74011 | Race condition between ::performReceive and ::update_connections() | ||
|---|---|---|---|
| Submitted: | 22 Sep 2014 12:27 | Modified: | 17 Oct 2014 12:58 |
| Reporter: | Ole John Aske | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
| Version: | 7.3.6 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[1 Oct 2014 13:07]
Ole John Aske
Posted by developer: Added Mikael R. as second reviewer as he reviewed pr maillist and replied: ---------------------- Hi, Reviewed and looks ok for me. Rgrds Mikael
[16 Oct 2014 11:10]
Ole John Aske
Posted by developer: Note to Doc-team: Even if (parts of) this fix is pushed 7.1->, the real fix is only present from 7.3-> as that is where the regression also was introduced.
[17 Oct 2014 12:58]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.
Documented fix as follows in the NDB 7.3.8 changelog:
A basic requirement of the NDB storage engine's design is that
the transporter registry not attempt to receive data
(TransporterRegistry::performReceive()) from and update the
connection status (TransporterRegistry::update_connections()) of
the same set of transporters concurrently, due to the fact that
the updates perform final cleanup and reinitialization of
buffers used when receiving data. Changing the contents of these
buffers while reading or writing to them could lead to "garbage"
or inconsistent signals being read or written.
During the course of work done previously to improve the
implementation of the transporter facade, a mutex intended to
protect against the concurrent use of the performReceive() and
update_connections()) methods on the same transporter was
inadvertently removed. This fix adds a watchdog check for
concurrent usage. In addition, update_connections() and
performReceive() calls are now serialized together while polling
the transporters.
Closed.
If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at
http://dev.mysql.com/doc/en/installing-source.html

Description: There is a basic requirement in the design that TransporterRegistry::performReceive() and ::update_connection() should not be run concurrently on the same set of Transporters. This is due to ::update_connections() is doing the final cleanup and re-init of buffers being used by ::performReceive(). Changing these while a concurrent ::performReceive() read/writes the same buffers may cause garbage or inconsistent signals being read or written. Prior to WL#3860 (ATC patches) TransporterFacade::theMutexPtr was used to protect these methods from concurrently modifying the same Transporter. That WL removed that mutex without providing other protection for concurrent usage of these methods, and thus introduced a regression. How to repeat: Apply this patch which detects concurrent usage of the two methods in 'subject', then run MTR test '--suite=ndb' for a short while: === modified file 'storage/ndb/include/transporter/TransporterCallback.hpp' --- storage/ndb/include/transporter/TransporterCallback.hpp revid:ole.john.aske@oracle.com-20140911090125-000jtrm6dz3wj6e0 +++ storage/ndb/include/transporter/TransporterCallback.hpp 2014-09-22 06:49:37 +0000 @@ -110,7 +110,17 @@ /** * */ +#ifndef NDEBUG + /** + * 'm_active' is used by 'class TransporterReceiveWatchdog' in + * DEBUG to detect concurrent calls to ::update_connections and + * ::performReceive() which isn't allowed. + */ + TransporterReceiveHandle() : m_active(false) {}; + bool m_active; +#endif }; /** === modified file 'storage/ndb/src/common/transporter/TransporterRegistry.cpp' --- storage/ndb/src/common/transporter/TransporterRegistry.cpp revid:ole.john.aske@oracle.com-20140911090125-000jtrm6dz3wj6e0 +++ storage/ndb/src/common/transporter/TransporterRegistry.cpp 2014-09-22 07:02:56 +0000 @@ -50,6 +50,42 @@ #include <EventLogger.hpp> extern EventLogger * g_eventLogger; +/** + * There is a requirement in the Transporter design that + * ::performReceive() and ::update_connections() + * on the same 'TransporterReceiveHandle' should not be + * run concurrently. class TransporterReceiveWatchdog provides a + * simple mechanism to assert that this rule is followed. + * Does nothing if NDEBUG is defined (in production code) + */ +class TransporterReceiveWatchdog +{ +public: +#if NDEBUG + TransporterReceiveWatchdog(TransporterReceiveHandle& recvdata) + {} + +#else + TransporterReceiveWatchdog(TransporterReceiveHandle& recvdata) + : m_recvdata(recvdata) + { + assert(m_recvdata.m_active == false); + m_recvdata.m_active = true; + } + + ~TransporterReceiveWatchdog() + { + assert(m_recvdata.m_active == true); + m_recvdata.m_active = false; + } + +private: + TransporterReceiveHandle& m_recvdata; +#endif +}; + struct in_addr TransporterRegistry::get_connect_address(NodeId node_id) const { @@ -1365,6 +1401,7 @@ Uint32 TransporterRegistry::performReceive(TransporterReceiveHandle& recvdata) { + TransporterReceiveWatchdog guard(recvdata); assert((receiveHandle == &recvdata) || (receiveHandle == 0)); if (recvdata.m_recv_transporters.get(0)) @@ -1905,6 +1942,7 @@ void TransporterRegistry::update_connections(TransporterReceiveHandle& recvdata) { + TransporterReceiveWatchdog guard(recvdata); assert((receiveHandle == &recvdata) || (receiveHandle == 0)); for (int i= 0, n= 0; n < nTransporters; i++){