Bug #74521 Incorrect schemaTrans outcome reported during DICT master takeover.
Submitted: 23 Oct 2014 8:35 Modified: 9 Dec 2014 12:27
Reporter: Ole John Aske Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: Disk Data Severity:S3 (Non-critical)
Version:7.1.33 OS:Any
Assigned to: CPU Architecture:Any

[23 Oct 2014 8:35] Ole John Aske
When a node acting as DICT master fails, it should still
be possible to request commit or abort of any open schemaTrans.
These requests will be sent to the new DICT master, which will
take over the schemaTrans, and report back whether the commit / abort
request succeeded.

In order to determine which node has become the new master, the
client simply try a random node. If this is not a master node,
a REF with the error code 'NotMaster' will be returned. The
client will then poll another node for the master role next time,
until it either get a CONF, or a REF with a 'real error' (Like TxnAbort)

However, by studying the AutoTest outcome from 'testDict -n schemaTrans',
we observe scenarios where:

- The test case insert an error which cause master node
  failures during TRANS_END_REQ(commit).
- NODE_FAILREP is returned to client.
- Client resend TRANS_END_REQ(commit) to another node
  which is *not the new master*.
- As the commit processing has already completed on this
  node, it incorrectly reply with error 
  '781: Invalid schema transaction key from NDB API'
  instead of 'NotMaster' as required by 'protocol'

This cause the client to incorrectly conclude that
the schemaTrans failed (aborted) - If it had instead
contacted the new master, that node would have ressurected
the transaction status from the remaining schemaTrans slaves,
and completed the commit.

Below is a dump from the 'v9' testcase in 'testDict -n schemaTrans'
which encounter this problem:
(Problem not limited to only testing though)

testDict started [2014-10-23 09:47:40]
|- T1
- SchemaTrans started [2014-10-23 09:47:40]
CASE y9 st_test_mnf_end_partial+3 - master node fail in end phase, commit, partial rollforward
ERR: receiveResponse - theImpl->theWaiter.m_state = 1
retry sleep 80ms on error 4013
FAIL 5573 res == 0: 781: Invalid schema transaction key from NDB API
FAIL 6865 st_end_trans(c, ST_CommitFlag) == 0
FAIL 7405 (*test.func)(c, test.arg) == NDBT_OK
FAIL 7477 st_test(c, test) == NDBT_OK
nodes up:2,3,4 down:5 unknown:
  |- runSchemaTrans FAILED [2014-10-23 09:51:30]
Node failed when TCRELEASE sent
Node failed when TCRELEASE sent
Node failed when TCRELEASE sent
- SchemaTrans FAILED [2014-10-23 09:51:30]
Completed testDict [2014-10-23 09:51:30]
            FAIL  230 secs (230734 ms)

How to repeat:
Testcase 'v9' in AutoTest 'testDict -n schemaTrans'
[9 Dec 2014 12:27] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

  Documented fix in the NDB 7.1.34, 7.2.19, and 7.3.8 changelogs as follows;

        When a node acting as DICT master fails, it is still possible to
        request that any open schema transaction be either committed or
        aborted by sending this request to the new DICT master. In this
        event, the new master takes over the schema transaction and
        reports back on whether the commit or abort request succeeded.

        In certain cases, it was possible for the new master to be
        misidentified--that is, the request was sent to the wrong
        node, which responded with an error that was interpreted by the
        client application as an aborted schema transaction, even in
        cases where it could have been successfully committed, had the
        correct node been contacted.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at