Bug #41915 Rare crash during recovery due to incorrect checkpoint of MM-part of disktable
Submitted: 7 Jan 2009 10:05 Modified: 26 May 2009 7:57
Reporter: Pekka Nousiainen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S2 (Serious)
Version:mysql-5.1-telco-6.x OS:Any
Assigned to: Jonas Oreland
Tags: mysql-5.1.x-telco-6.x
Triage: Triaged: D1 (Critical) / R3 (Medium) / E4 (High)

[7 Jan 2009 10:05] Pekka Nousiainen
Description:
testSystemRestart -v -n SR_DD_1 D1
crash in DbtupDiskAlloc.cpp line 1102:
Dbtup::disk_page_free(...
  if (tabPtrP->m_attributes[DD].m_no_of_varsize == 0)...
    ndbassert(* (src + 1) != Tup_fixsize_page::FREE_RECORD);

How to repeat:
see description

Suggested fix:
may be related to bug#41398
[23 Jan 2009 14:04] Pekka Nousiainen
Same test case but 2 (instead of 4) LQH threads.
crash at line 1065
Dbtup::disk_page_alloc..
  ddassert(pagePtr.p->uncommitted_used_space > 0)
This 2-thread case took several hours.
[27 Apr 2009 10:25] Pekka Nousiainen
probably fixed by these:
http://lists.mysql.com/commits/72427
http://lists.mysql.com/commits/71907
[27 Apr 2009 10:26] Pekka Nousiainen
probably fixed by these:
http://lists.mysql.com/commits/72427
http://lists.mysql.com/commits/71907
[27 Apr 2009 10:27] Pekka Nousiainen
happens when you lose internet after submit
[26 May 2009 3:32] Jonas Oreland
During checkpoint, DD and MM create a consistent point that they both restore to.
There was in the MM part that (really rarely) could include/exclude one row that
should be in the snapshot. This would later cause crash during/after recovery.
[26 May 2009 4:14] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/74930

2936 Jonas Oreland	2009-05-26
      ndb - bug#41915 - fix spurious crash in recovery of DD tables
[26 May 2009 4:52] Bugs System
Pushed into 5.1.34-ndb-7.0.6 (revid:jonas@mysql.com-20090526044928-bx3798wzc46ypnop) (version source revid:jonas@mysql.com-20090526044928-bx3798wzc46ypnop) (merge vers: 5.1.34-ndb-7.0.6) (pib:6)
[26 May 2009 4:53] Bugs System
Pushed into 5.1.34-ndb-6.2.18 (revid:jonas@mysql.com-20090526041403-0qtjtehbumdqqdgc) (version source revid:jonas@mysql.com-20090526041403-0qtjtehbumdqqdgc) (merge vers: 5.1.34-ndb-6.2.18) (pib:6)
[26 May 2009 4:54] Bugs System
Pushed into 5.1.34-ndb-6.3.26 (revid:jonas@mysql.com-20090526042602-qei3xzhbx53556k8) (version source revid:jonas@mysql.com-20090526042602-qei3xzhbx53556k8) (merge vers: 5.1.34-ndb-6.3.26) (pib:6)
[26 May 2009 5:02] Jonas Oreland
note to docs:
1) read my comment above on checkpoint
2) this seems to be somewhat more likely if using ndbmtd in 7.x
[26 May 2009 7:57] Jon Stephens
Documented bugfix in the NDB-6.2.18, 6.3.26, and 7.0.6 changelogs as follows:

      During a checkpoint, restore points are created for both the on-disk and
      in-memory parts of a Disk Data table. Under certain rare conditions, 
      the in-memory restore point could include or exclude a row that
      should have been in the snapshot. This would later later lead to a crash
      during or following recovery.

      [7.0.6 version only:] 
      This issue was somewhat more likely to be encountered when using
      ndbmtd.