Bug #36247 Incorrect handling of LQH_TRANSREQ with cascading master failure
Submitted: 22 Apr 11:30 Modified: 31 May 12:47
Reporter: Jonas Oreland
Status: Closed
Category:Server: Cluster Severity:S3 (Non-critical)
Version:* OS:Any
Assigned to: Jonas Oreland Target Version:6.0
Triage: D1 (Critical) / R1 (None/Negligible) / E1 (None/Negligible)

[22 Apr 11:30] Jonas Oreland
Description:
When a node X dies,
the master Y will start to take-over it's transaction

It will then send a LQH_TRANSREQ to all LQH:es

If Y dies during this take-over, while LQH is scanning markers
then LQH_TRANSCONF will not be sent to new master Z, but old master Y

Which will lead to endless "node-failure handling of node X not complete" 

How to repeat:
.

Suggested fix:
Check tcNodeFailptr.p->tcFailStatus == TcNodeFailRecord::TC_STATE_BREAK
in scanMarkers
[25 Apr 9:55] Jonas Oreland
pushed to 51-ndb, telco* and drop6
(50-ndb was locked for unknown reason)
[20 May 11:34] Jon Stephens
Documented in the 5.1.24-ndb-6.3.14 changelog as follows:

        Under certain rare circumstances, the failure of the new master node
        while attempting a node takeover would cause takeover errors to repeat
        without being resolved.

Left Patch Queued status pending further merges.
[31 May 12:47] Jon Stephens
Closed per yesterday's discussion with Jonas.