Bug #74320 Ensure master takeover isn't blocked for too long
Submitted: 10 Oct 2014 13:22 Modified: 22 Dec 2014 18:06
Reporter: Mikael Ronström Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.4.0 OS:Any
Assigned to: CPU Architecture:Any

[10 Oct 2014 13:22] Mikael Ronström
Description:
Master takeover is waiting for the currently queued fragments to complete their task before the node failure handling is completed. With the new 7.4 development this means that we might have to wait for up to 64 fragments to complete before we can complete the node failure handling. This can take
a very long time, also there is very little log printouts that tells what the system is waiting for.

How to repeat:
Run 8-node cluster, fail 4 nodes after inserting 200M records using flexAsynch and study handling of allocate nodeid that will
sometimes take more than 30 seconds without any log printouts specifying why.

Suggested fix:
Ensure that DBLQH can report that it is ready with its part of Master takeover even before the LQH has completed the currently
ongoing fragment checkpoints. LQH will be able to continue processing its queue while the rest of the system is completing the
master takeover. When those fragments are ordered to be checkpointed then LQH will couple those requests to the currently ongoing
or already completed checkpoint tasks.
[22 Dec 2014 18:06] Jon Stephens
Thank you for your bug report. This issue has already been fixed in the latest released version of that product, which you can download at

  http://www.mysql.com/downloads/

Documented fix in the NDB 7.4.3 changelog as follows:
    In NDB version 7.4, node failure handling can require completing
    checkpoints on up to 64 fragments. (This checkpointing is
    performed by the DBLQH kernel block.) The requriement for master
    takeover to wait for completion of all such checkpoints led in
    such cases to excessive length of time for completion.

    To address these issues, the DBLQH kernel block can now report
    that it is ready with its part of master takeover before it has
    completed any ongoing fragment checkpoints, and can continue
    processing its queue while the system completes the master
    takeover.

Closed.