MySQL Bugs: #36247: Incorrect handling of LQH_TRANSREQ with cascading master failure

Bug #36247	Incorrect handling of LQH_TRANSREQ with cascading master failure
Submitted:	22 Apr 2008 9:30	Modified:	31 May 2008 10:47
Reporter:	Jonas Oreland	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	*	OS:	Any
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
When a node X dies,
the master Y will start to take-over it's transaction

It will then send a LQH_TRANSREQ to all LQH:es

If Y dies during this take-over, while LQH is scanning markers
then LQH_TRANSCONF will not be sent to new master Z, but old master Y

Which will lead to endless "node-failure handling of node X not complete" 

How to repeat:
.

Suggested fix:
Check tcNodeFailptr.p->tcFailStatus == TcNodeFailRecord::TC_STATE_BREAK
in scanMarkers

pushed to 51-ndb, telco* and drop6
(50-ndb was locked for unknown reason)

Documented in the 5.1.24-ndb-6.3.14 changelog as follows:

        Under certain rare circumstances, the failure of the new master node
        while attempting a node takeover would cause takeover errors to repeat
        without being resolved.

Left Patch Queued status pending further merges.

Closed per yesterday's discussion with Jonas.

Pushed into 6.0.6-alpha  (revid:sp1r-jonas@perch.ndb.mysql.com-20080423140838-48946) (version source revid:jonas@mysql.com-20080808094047-4e1yiarqa2t3opg3) (pib:5)