Bug #35241 Out of REDO log due to incorrectly handled stopped node
Submitted: 12 Mar 2008 9:14 Modified: 31 May 2008 10:38
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:< telco.6.3 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[12 Mar 2008 9:14] Jonas Oreland
Description:
When starting a LCP, keep gci (last part of REDO log)
  is calculated based on all tables/fragments/replicas
  in the system

If a table/fragment/replica is created and *does not* have a LCP at all
  the createGci (i.e when table was created) will be used as keepGci 
  (for that table/fragment/replica)

  So if node is holding replica is dead, it can make keepGci not move for
  infinity which leads to out of read log (and subsequent disasters)

This scenario can be repeated by
1) create table, stop node before LCP occured
2) start cluster "initial/partial"

maybe other cases.

How to repeat:
see above

Suggested fix:
1) Short term: Dont use createGci for replicas residing on dead nodes.
2) Long term: impl. local LCP, so nodes manage their own REDO log
[12 Mar 2008 9:19] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/43811

ChangeSet@1.2196, 2008-03-12 10:19:20+01:00, jonas@perch.ndb.mysql.com +2 -0
  ndb - bug#35241 (drop6)
    Out of REDO log due to incorrectly handled stopped node
[12 Mar 2008 9:28] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/43812

ChangeSet@1.2533, 2008-03-12 10:28:14+01:00, jonas@perch.ndb.mysql.com +2 -0
  ndb - bug#35241
    Out of REDO log due to incorrectly handled stopped node
[12 Mar 2008 10:15] Jonas Oreland
pushed to drop6, 51-ndb, telco-6.2
wont fix in 4.1/5.0
[4 Apr 2008 20:17] Jon Stephens
Documented in the 5.1.23-ndb-6.3.11 changelog as follows:

        In some circumstances, a stopped data node was handled incorrectly,
        leading to redo log space being exhausted following an initial restart
        of the node, or an initial or partial restart of the cluster (the wrong
        CGI might be used in such cases). This could happen, for example, when a
        node was stopped following the creation of a new table, but before a new
        LCP could be executed.

Left in Patch Queued status pending additional merges.
[4 Apr 2008 22:41] Jon Stephens
Fix also noted in the 5.1.23-ndb-6.2.15 changelog.
[31 May 2008 10:38] Jon Stephens
Closed following yesterday's discussion with Jonas.
[12 Dec 2008 23:29] Bugs System
Pushed into 6.0.6-alpha  (revid:sp1r-jonas@perch.ndb.mysql.com-20080312092814-21354) (version source revid:sp1r-tomas@poseidon.ndb.mysql.com-20080516085603-30848) (pib:5)