Bug #57563 m_gcp_rep_counter ring buffer capacity exceeded
Submitted: 19 Oct 2010 11:17 Modified: 20 Oct 2010 14:25
Reporter: Hartmut Holzgraefe Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version: OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any

[19 Oct 2010 11:17] Hartmut Holzgraefe
SUMA has a 10 element ring buffer for storing out-of-order SUB_GCP_COMPLETE_REP signals received from LQH upon completion of GCPs.

We've seen an incident now where both nodes of a node group failed with an ndb_require assertion as the ring buffer capacity was exceeded on both nodes at the same time.

How to repeat:
This has so far only been seen to happen once, so there is no clear pattern yet as of what may have lead to this situation.

Suggested fix:
Enforce in-order processing of SUB_GCP_COMPLETE_REP signals so that no ring buffer for storing them when arriving out-of-order is needed anymore. Such a change seems to be possible without having any negative performance impacts.
[20 Oct 2010 7:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:


3879 Jonas Oreland	2010-10-20
      ndb - bug#57563 - add new tool to ndbmtd thread tool-box: synronize_path() which makes blocks until a message has traversed path given as input. Use this in micro-gcp instead of syncronize_threads_for_blocks() too be sure to avoid starvation even in pathological cases
[20 Oct 2010 7:30] Bugs System
Pushed into mysql-5.1-telco-7.0 5.1.51-ndb-7.0.20 (revid:jonas@mysql.com-20101020071258-mamleyk2226czwe4) (version source revid:jonas@mysql.com-20101020071258-mamleyk2226czwe4) (merge vers: 5.1.51-ndb-7.0.20) (pib:21)
[20 Oct 2010 7:35] Jonas Oreland
pushed to 7.0.20 and 7.1.9
[20 Oct 2010 14:25] Jon Stephens
Documented bugfix in the NDB-7.0.20 and 7.1.9 changelogs as follows:

      The SUMA kernel block has a 10-element ring buffer for storing
      out-of-order SUB_GCP_COMPLETE_REP signals received from the local query
      handlers when global checkpoints are completed. In some cases, exceeding
      the ring buffer capacity on all nodes of a node group at the same time
      caused the node group to fail with an assertion.