| Bug #74521 | Incorrect schemaTrans outcome reported during DICT master takeover. | ||
|---|---|---|---|
| Submitted: | 23 Oct 2014 8:35 | Modified: | 9 Dec 2014 12:27 |
| Reporter: | Ole John Aske | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Cluster: Disk Data | Severity: | S3 (Non-critical) |
| Version: | 7.1.33 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[9 Dec 2014 12:27]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.
Documented fix in the NDB 7.1.34, 7.2.19, and 7.3.8 changelogs as follows;
When a node acting as DICT master fails, it is still possible to
request that any open schema transaction be either committed or
aborted by sending this request to the new DICT master. In this
event, the new master takes over the schema transaction and
reports back on whether the commit or abort request succeeded.
In certain cases, it was possible for the new master to be
misidentified--that is, the request was sent to the wrong
node, which responded with an error that was interpreted by the
client application as an aborted schema transaction, even in
cases where it could have been successfully committed, had the
correct node been contacted.
Closed.
If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at
http://dev.mysql.com/doc/en/installing-source.html

Description: When a node acting as DICT master fails, it should still be possible to request commit or abort of any open schemaTrans. These requests will be sent to the new DICT master, which will take over the schemaTrans, and report back whether the commit / abort request succeeded. In order to determine which node has become the new master, the client simply try a random node. If this is not a master node, a REF with the error code 'NotMaster' will be returned. The client will then poll another node for the master role next time, until it either get a CONF, or a REF with a 'real error' (Like TxnAbort) However, by studying the AutoTest outcome from 'testDict -n schemaTrans', we observe scenarios where: - The test case insert an error which cause master node failures during TRANS_END_REQ(commit). - NODE_FAILREP is returned to client. - Client resend TRANS_END_REQ(commit) to another node which is *not the new master*. - As the commit processing has already completed on this node, it incorrectly reply with error '781: Invalid schema transaction key from NDB API' instead of 'NotMaster' as required by 'protocol' This cause the client to incorrectly conclude that the schemaTrans failed (aborted) - If it had instead contacted the new master, that node would have ressurected the transaction status from the remaining schemaTrans slaves, and completed the commit. Below is a dump from the 'v9' testcase in 'testDict -n schemaTrans' which encounter this problem: (Problem not limited to only testing though) ====================== testDict started [2014-10-23 09:47:40] |- T1 - SchemaTrans started [2014-10-23 09:47:40] .... CASE y9 st_test_mnf_end_partial+3 - master node fail in end phase, commit, partial rollforward ERR: receiveResponse - theImpl->theWaiter.m_state = 1 retry sleep 80ms on error 4013 FAIL 5573 res == 0: 781: Invalid schema transaction key from NDB API FAIL 6865 st_end_trans(c, ST_CommitFlag) == 0 FAIL 7405 (*test.func)(c, test.arg) == NDBT_OK FAIL 7477 st_test(c, test) == NDBT_OK nodes up:2,3,4 down:5 unknown: |- runSchemaTrans FAILED [2014-10-23 09:51:30] Node failed when TCRELEASE sent Node failed when TCRELEASE sent Node failed when TCRELEASE sent - SchemaTrans FAILED [2014-10-23 09:51:30] Completed testDict [2014-10-23 09:51:30] = SUMMARY OF TEST EXECUTION ============== SchemaTrans FAIL 230 secs (230734 ms) How to repeat: Testcase 'v9' in AutoTest 'testDict -n schemaTrans'