MySQL Bugs: #44952: Ndbd file system inconsistency error

Bug #44952	Ndbd file system inconsistency error
Submitted:	19 May 2009 13:00	Modified:	19 Dec 2009 9:42
Reporter:	Guido Ostkamp	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1	OS:	Solaris
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	5.0, mysql-5.1-telco-*

Description:
Hello,

during stability tests MySQL Cluster failed with the following bug reports:

009-05-19 14:33:06 [ndbd] INFO     -- Error while reading REDO log. from 16375
D=10, F=2 Mb=24 FP=799 W1=8177 W2=7 : end of log wo/ having found last GCI gci: 215259
2009-05-19 14:33:06 [ndbd] INFO     -- DBLQH (Line: 16482) 0x0000000a
2009-05-19 14:33:06 [ndbd] INFO     -- Error handler startup shutting down system
2009-05-19 14:33:06 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-05-19 14:33:06 [ndbd] INFO     -- Angel received ndbd startup failure count 1.
2009-05-19 14:33:10 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2310: 'Error whi
le reading the REDO log(Ndbd file system inconsistency error, please report a bug). Ndbd file system error, restart node initial'.

Details for build:

Bazaar branch mysql-cluster-7.0 revision-id: jonas@mysql.com-20090513153236-s7n0vgk02qprth52 compiled with binary optimization
CC=cc CXX=CC CFLAGS="-xO5 -fast -mt -m64 -xbinopt=prepare" CXXFLAGS="-xO5 -fast -mt -m64 -instances=semiexplicit -xbinopt=prepare" LDFLAGS="-xbinopt=prepare" ./configure --prefix=/export/home/wsch/6.4_2009_01_29 --with-plugins=all --without-docs --without-man

OS: SunOS pelton1 5.10 Generic_137111-08 sun4v sparc SUNW,Netra-T2000

I will upload full logs to FTP server soon.

Regards

Guido Ostkamp

How to repeat:
1. Run MySQL cluster on two physical nodes with 1 data node on each physical node (here hostnames are pelton1/pelton2)
2. Run some test applications on pelton1 (first node)
3. Repeatedly kill -9 on ndbmtd (also during its recovery restart)
4. Error occurs

Full logs have been uploaded to bug-data-44952.tar.bz2 on FTP server.

I forgot some details:

The kill -9 was done on second node, i.e. pelton2 here.

The bug might be related or identical to http://bugs.mysql.com/bug.php?id=42564 reported some month ago. Suggest to close the old one and continue here if its the same issue.

Hi Guido, 

I've now manage to produce a reproducible test case!
The bug occurs when a GCI marker is written in the first page of a mega-byte,
and that node is later killed during restart after having processed that page
(marker) but before completing it's LCP. roughly :-)

I've also written a fix, but would like to run our internal regression suite
before sending it to you. This takes ~24 hours, so if all goes well, i'll send you a patch on monday morning (or sunday evening)

Is this good enough for you?

/Jonas

ps. bug#42564 is a duplicate of this.

Hello Jonas,

> Is this good enough for you?

yes, this is fine with us, we shall wait until Monday morning for the fix.
Please post the patch either here or send direct email, whatever you prefer.

Regards

Guido Ostkamp

This was pushed as http://lists.mysql.com/commits/74850
I however miss-typed the bug number in the commit message
so it's not automatically linked to this bug report :-(
pushed to 6.2,6.3 and 7.0

for docs: 6.2.18, 6.3.26 and 7.0.6

Minimal patch that fixes problem (wo/ test etc...)

Attachment: bug44952.patch (text/x-patch), 762 bytes.

Thank you for providing the patch.  e have updated to the latest bazaar version (including the patch). Tests will most likely be continued tomorrow. We will report the outcome here provided the bug report is not closed earlier (in this case, we will open a new bug report if the bug can still be reproduced).

Regards

Guido Ostkamp

Documented bugfix in the NDB-6.2.18, 6.3.26, and 7.0.6 changelogs as follows:

        In some cases, master takeover immediately following a sucessful
        node restart caused the restarted node to send back incorrect
        global checkpoint data. The resulting (nonexistent) GCI was
        saved to system files, so that if the node were restarted again,
        it would try to recover using this invalid GCI.

Jon: your description is 0.3% correct.
The bug fixed is what is decribed in my comment from 22 May 11:33

Updated changelog entry as suggested by Jonas:

        When a data node had written its GCI marker to the first page of
        a megabyte, and that node was later killed during restart after
        having processed that page (marker) but before completing a LCP,
        the data node could fail with filesystem errors.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/88992

2835 Davi Arnaut	2009-11-02
      Bug#44952: Ndbd file system inconsistency error
      
      Backport a ndb patch: fix bug with crash during restart, where
      a mbyte incorrectly could be skipped, leading to "end of log
      wo/ finding gci".

Queued to 5.0-bugteam and up.

Pushed into 5.0.88 (revid:joro@sun.com-20091104091355-hpz6dwgkrfmokj3k) (version source revid:davi.arnaut@sun.com-20091102201021-1brn7cjb1kvqg9gr) (merge vers: 5.0.88) (pib:13)

Pushed into 5.1.41 (revid:joro@sun.com-20091104092152-qz96bzlf2o1japwc) (version source revid:kristofer.pettersson@sun.com-20091103162305-08l4gkeuif2ozsoj) (merge vers: 5.1.41) (pib:13)

Also documented this fix in the 5.0.88 changelog.

(Did not add to 5.1.41 changelog since Cluster isn't supported in that version.)

Closed BUG#42564 as a duplicate of this bug.

Pushed into 6.0.14-alpha (revid:alik@sun.com-20091110093407-rw5g8dys2baqkt67) (version source revid:alik@sun.com-20091109080109-7dxapd5y5pxlu08w) (merge vers: 6.0.14-alpha) (pib:13)

Pushed into 5.5.0-beta (revid:alik@sun.com-20091109115615-nuohp02h8mdrz8m2) (version source revid:alik@sun.com-20091105110316-pn162zoddxj89lbh) (merge vers: 5.5.0-beta) (pib:13)

Did not add to 5.5 or 6.0 changelogs since Cluster isn't supported in that versions.

Re-closed.

Pushed into 5.1.41-ndb-7.1.0 (revid:jonas@mysql.com-20091218102229-64tk47xonu3dv6r6) (version source revid:jonas@mysql.com-20091218095730-26gwjidfsdw45dto) (merge vers: 5.1.41-ndb-7.1.0) (pib:15)

Pushed into 5.1.41-ndb-6.2.19 (revid:jonas@mysql.com-20091218100224-vtzr0fahhsuhjsmt) (version source revid:jonas@mysql.com-20091217101452-qwzyaig50w74xmye) (merge vers: 5.1.41-ndb-6.2.19) (pib:15)

Pushed into 5.1.41-ndb-6.3.31 (revid:jonas@mysql.com-20091218100616-75d9tek96o6ob6k0) (version source revid:jonas@mysql.com-20091217154335-290no45qdins5bwo) (merge vers: 5.1.41-ndb-6.3.31) (pib:15)

Pushed into 5.1.41-ndb-7.0.11 (revid:jonas@mysql.com-20091218101303-ga32mrnr15jsa606) (version source revid:jonas@mysql.com-20091218064304-ezreonykd9f4kelk) (merge vers: 5.1.41-ndb-7.0.11) (pib:15)

No additional changelog entries needed; re-closing.