Bug #44952 Ndbd file system inconsistency error
Submitted: 19 May 2009 13:00 Modified: 19 Dec 2009 9:42
Reporter: Guido Ostkamp Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1 OS:Solaris
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: 5.0, mysql-5.1-telco-*

[19 May 2009 13:00] Guido Ostkamp
Description:
Hello,

during stability tests MySQL Cluster failed with the following bug reports:

009-05-19 14:33:06 [ndbd] INFO     -- Error while reading REDO log. from 16375
D=10, F=2 Mb=24 FP=799 W1=8177 W2=7 : end of log wo/ having found last GCI gci: 215259
2009-05-19 14:33:06 [ndbd] INFO     -- DBLQH (Line: 16482) 0x0000000a
2009-05-19 14:33:06 [ndbd] INFO     -- Error handler startup shutting down system
2009-05-19 14:33:06 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-05-19 14:33:06 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2009-05-19 14:33:10 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2310: 'Error whi
le reading the REDO log(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.

Details for build:

Bazaar branch mysql-cluster-7.0 revision-id: jonas@mysql.com-20090513153236-s7n0vgk02qprth52 compiled with binary optimization
CC=cc CXX=CC CFLAGS="-xO5 -fast -mt -m64 -xbinopt=prepare" CXXFLAGS="-xO5 -fast -mt -m64 -instances=semiexplicit -xbinopt=prepare" LDFLAGS="-xbinopt=prepare" ./configure --prefix=/export/home/wsch/6.4_2009_01_29 --with-plugins=all --without-docs --without-man

OS: SunOS pelton1 5.10 Generic_137111-08 sun4v sparc SUNW,Netra-T2000

I will upload full logs to FTP server soon.

Regards

Guido Ostkamp

How to repeat:
1. Run MySQL cluster on two physical nodes with 1 data node on each physical node (here hostnames are pelton1/pelton2)
2. Run some test applications on pelton1 (first node)
3. Repeatedly kill -9 on ndbmtd (also during its recovery restart)
4. Error occurs
[19 May 2009 13:02] Guido Ostkamp
Full logs have been uploaded to bug-data-44952.tar.bz2 on FTP server.
[19 May 2009 13:06] Guido Ostkamp
I forgot some details:

The kill -9 was done on second node, i.e. pelton2 here.
[19 May 2009 13:14] Guido Ostkamp
The bug might be related or identical to http://bugs.mysql.com/bug.php?id=42564 reported some month ago. Suggest to close the old one and continue here if its the same issue.
[22 May 2009 9:33] Jonas Oreland
Hi Guido, 

I've now manage to produce a reproducible test case!
The bug occurs when a GCI marker is written in the first page of a mega-byte,
and that node is later killed during restart after having processed that page
(marker) but before completing it's LCP. roughly :-)

I've also written a fix, but would like to run our internal regression suite
before sending it to you. This takes ~24 hours, so if all goes well, i'll send you a patch on monday morning (or sunday evening)

Is this good enough for you?

/Jonas

ps. bug#42564 is a duplicate of this.
[22 May 2009 10:02] Guido Ostkamp
Hello Jonas,

> Is this good enough for you?

yes, this is fine with us, we shall wait until Monday morning for the fix.
Please post the patch either here or send direct email, whatever you prefer.

Regards

Guido Ostkamp
[24 May 2009 19:08] Jonas Oreland
This was pushed as http://lists.mysql.com/commits/74850
I however miss-typed the bug number in the commit message
so it's not automatically linked to this bug report :-(
pushed to 6.2,6.3 and 7.0
[24 May 2009 19:10] Jonas Oreland
for docs: 6.2.18, 6.3.26 and 7.0.6
[24 May 2009 19:12] Jonas Oreland
Minimal patch that fixes problem (wo/ test etc...)

Attachment: bug44952.patch (text/x-patch), 762 bytes.

[25 May 2009 13:24] Guido Ostkamp
Thank you for providing the patch.  e have updated to the latest bazaar version (including the patch). Tests will most likely be continued tomorrow. We will report the outcome here provided the bug report is not closed earlier (in this case, we will open a new bug report if the bug can still be reproduced).

Regards

Guido Ostkamp
[26 May 2009 9:06] Jon Stephens
Documented bugfix in the NDB-6.2.18, 6.3.26, and 7.0.6 changelogs as follows:

        In some cases, master takeover immediately following a sucessful
        node restart caused the restarted node to send back incorrect
        global checkpoint data. The resulting (nonexistent) GCI was
        saved to system files, so that if the node were restarted again,
        it would try to recover using this invalid GCI.
[26 May 2009 9:13] Jonas Oreland
Jon: your description is 0.3% correct.
The bug fixed is what is decribed in my comment from 22 May 11:33
[26 May 2009 9:22] Jon Stephens
Updated changelog entry as suggested by Jonas:

        When a data node had written its GCI marker to the first page of
        a megabyte, and that node was later killed during restart after
        having processed that page (marker) but before completing a LCP,
        the data node could fail with filesystem errors.
[2 Nov 2009 20:10] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/88992

2835 Davi Arnaut	2009-11-02
      Bug#44952: Ndbd file system inconsistency error
      
      Backport a ndb patch: fix bug with crash during restart, where
      a mbyte incorrectly could be skipped, leading to "end of log
      wo/ finding gci".
[2 Nov 2009 21:10] Davi Arnaut
Queued to 5.0-bugteam and up.
[4 Nov 2009 9:16] Bugs System
Pushed into 5.0.88 (revid:joro@sun.com-20091104091355-hpz6dwgkrfmokj3k) (version source revid:davi.arnaut@sun.com-20091102201021-1brn7cjb1kvqg9gr) (merge vers: 5.0.88) (pib:13)
[4 Nov 2009 9:24] Bugs System
Pushed into 5.1.41 (revid:joro@sun.com-20091104092152-qz96bzlf2o1japwc) (version source revid:kristofer.pettersson@sun.com-20091103162305-08l4gkeuif2ozsoj) (merge vers: 5.1.41) (pib:13)
[4 Nov 2009 14:45] Jon Stephens
Also documented this fix in the 5.0.88 changelog.

(Did not add to 5.1.41 changelog since Cluster isn't supported in that version.)

Closed BUG#42564 as a duplicate of this bug.
[11 Nov 2009 6:49] Bugs System
Pushed into 6.0.14-alpha (revid:alik@sun.com-20091110093407-rw5g8dys2baqkt67) (version source revid:alik@sun.com-20091109080109-7dxapd5y5pxlu08w) (merge vers: 6.0.14-alpha) (pib:13)
[11 Nov 2009 6:56] Bugs System
Pushed into 5.5.0-beta (revid:alik@sun.com-20091109115615-nuohp02h8mdrz8m2) (version source revid:alik@sun.com-20091105110316-pn162zoddxj89lbh) (merge vers: 5.5.0-beta) (pib:13)
[11 Nov 2009 15:49] Jon Stephens
Did not add to 5.5 or 6.0 changelogs since Cluster isn't supported in that versions.

Re-closed.
[18 Dec 2009 10:37] Bugs System
Pushed into 5.1.41-ndb-7.1.0 (revid:jonas@mysql.com-20091218102229-64tk47xonu3dv6r6) (version source revid:jonas@mysql.com-20091218095730-26gwjidfsdw45dto) (merge vers: 5.1.41-ndb-7.1.0) (pib:15)
[18 Dec 2009 10:53] Bugs System
Pushed into 5.1.41-ndb-6.2.19 (revid:jonas@mysql.com-20091218100224-vtzr0fahhsuhjsmt) (version source revid:jonas@mysql.com-20091217101452-qwzyaig50w74xmye) (merge vers: 5.1.41-ndb-6.2.19) (pib:15)
[18 Dec 2009 11:08] Bugs System
Pushed into 5.1.41-ndb-6.3.31 (revid:jonas@mysql.com-20091218100616-75d9tek96o6ob6k0) (version source revid:jonas@mysql.com-20091217154335-290no45qdins5bwo) (merge vers: 5.1.41-ndb-6.3.31) (pib:15)
[18 Dec 2009 11:22] Bugs System
Pushed into 5.1.41-ndb-7.0.11 (revid:jonas@mysql.com-20091218101303-ga32mrnr15jsa606) (version source revid:jonas@mysql.com-20091218064304-ezreonykd9f4kelk) (merge vers: 5.1.41-ndb-7.0.11) (pib:15)
[19 Dec 2009 9:42] Jon Stephens
No additional changelog entries needed; re-closing.