Bug #74154 | Node restart blocked by DICT LOCK held by schema transaction | ||
---|---|---|---|
Submitted: | 30 Sep 2014 13:58 | Modified: | 4 Nov 2014 18:05 |
Reporter: | Ole John Aske | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 7.3.7 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[30 Sep 2014 13:58]
Ole John Aske
[30 Sep 2014 14:36]
Ole John Aske
Posted by developer: This DICT lock holdup is caused by a race condition in the ::dictSignal() retry logic, and how early failures of TRANS_BEGIN_REQ are handled: Completion of TRANS_BEGIN_REQ is normally signaled by a TRANS_BEGIN_REF or a TRANS_BEGIN_CONF. However, if the master node of a schema transaction fails (NODE_FAILURE), API clients are informed about this by the new master sending a TRANS_END_REPort. As the new master can't possibly know whether the failed master ever sent a CONF, there might arrive TRANS_END_REP signal without the API ever seeing the 'CONF' of the starting transaction This has caused a race in the ::dictSignal() retry logic where it repeats a TRANS_BEGIN_REQ. This signal is repeated when certain ignorable errors are seen, like 'not a master', 'busy' and 'node failure' ++... Until the operation is either REFed, CONFed or transaction reported failed by TRANS_END_REP. However, as TRANS_END_REP is an asynch signal sent by the new master, this signal may arrive after a new TRANS_BEGIN_REQ is sent. The END_REP is then interpreted as a failure of the REQ, while it actually may succeed and return a CONF later. As the dict API has already completed polling, and incorrectly reported a failure, the later arriving CONF is never seen. We then hold a dict lock which we are not aware of, and this will not be released until the client terminates.
[4 Nov 2014 18:05]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. Fixed in MySQL Cluster NDB 7.1.34, 7.2.19, 7.3.8. Documented fix in these changelogs as follows: When a client retried against a new master a schema transaction that failed previously against the previous master while the latter was restarting, the lock obtained by this transaction on the new master prevented the previous master from progressing past start phase 3 until the client was terminated, and resources held by it were cleaned up. Closed. If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at http://dev.mysql.com/doc/en/installing-source.html