Bug #21948 | Repeated create/drop/truncate together with restarts can cause SR failure | ||
---|---|---|---|
Submitted: | 31 Aug 2006 15:26 | Modified: | 28 Dec 2006 14:04 |
Reporter: | Jonathan Miller | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Disk Data | Severity: | S1 (Critical) |
Version: | 5.1.12 | OS: | Linux (Linux 64 bit OS) |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[31 Aug 2006 15:26]
Jonathan Miller
[31 Aug 2006 15:34]
Jonathan Miller
mgm out log
Attachment: ndb_1_out.log (text/x-log), 451 bytes.
[31 Aug 2006 18:55]
Jonathan Miller
Still trying to reproduce the other bug, I reproduced this one, but got a little different ndb error this time: Time: Thursday 31 August 2006 - 20:49:30 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: tsman.cpp Error object: TSMAN (Line: 1822) 0x0000000a Program: /home/ndbdev/jmiller/builds/libexec/ndbd Pid: 14252 Trace: /space/run/ndb_3_trace.log.7 Version: Version 5.1.12 (beta)
[11 Sep 2006 16:14]
Jonas Oreland
Jeb, Did you test this on 5.1-main or 5.1-ndb Pekkas fix for this was pushed to main tree "1 Sep 9:58" Can you retest ? /Jonas
[21 Sep 2006 22:36]
Jonathan Miller
If first tried to reproduce this using memory tables and had no issues. I then used mixed of memory tables and disk tables. During recovery I recieved: Node 2: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error Time: Friday 22 September 2006 - 00:11:09 Status: Temporary error, restart node Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug) Error: 2341 Error data: tsman.cpp Error object: TSMAN (Line: 1822) 0x0000000a Program: /home/ndbdev/jmiller/builds/libexec/ndbd Pid: 23529 Trace: /space/run/ndb_2_trace.log.2 Version: Version 5.1.12 (beta) ***EOM*** JAM CONTENTS up->down left->right ?=not block entry BLOCK ADDR ADDR ADDR ADDR ADDR ADDR ADDR ADDR ?000915 NDBFS 000513 000173 PGMAN 000209 000218 DBLQH 002484 DBTC 004047 DBTUP 002032 NDBFS 001040 001042 000853 000855 000773 000809 001054 QMGR 000116 000140 002090 PGMAN 001487 NDBFS 001040 000853 001059 PGMAN 000209 000218 000954 000981 DBTUP 001308 020243 020245 TSMAN 001757 001032 LGMAN 000162 000208 001605 000650 000653 DBTUP 001308 020243 020243 020245 --------------- Signal ---------------- r.bn: 260 "LGMAN", r.proc: 2, r.sigId: 4534222 gsn: 164 "CONTINUEB" prio: 1 s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 4534221 length: 1 trace: 4 #sec: 0 fragInf: 0 H'00000005 --------------- Signal ---------------- r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 4534221 gsn: 164 "CONTINUEB" prio: 1 s.bn: 261 "PGMAN", s.proc: 2, s.sigId: 4534219 length: 1 trace: 4 #sec: 0 fragInf: 0 H'00000001 --------------- Signal ---------------- r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 4534220 gsn: 164 "CONTINUEB" prio: 1 s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 4534217 length: 1 trace: 0 #sec: 0 fragInf: 0 Scanning the memory channel again with no delay --------------- Signal ---------------- r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 4534219 gsn: 262 "FSREADCONF" prio: 1 s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 4534217 length: 1 trace: 4 #sec: 0 fragInf: 0 UserPointer: 1417 --------------- Signal ---------------- r.bn: 252 "QMGR", r.proc: 2, r.sigId: 4534218 gsn: 164 "CONTINUEB" prio: 0 s.bn: 252 "QMGR", s.proc: 2, s.sigId: 4534216 length: 1 trace: 0 #sec: 0 fragInf: 0 H'00000004 --------------- Signal ---------------- r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 4534217 gsn: 164 "CONTINUEB" prio: 0 s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 4534215 length: 1 trace: 0 #sec: 0 fragInf: 0 Scanning the memory channel every 10ms --------------- Signal ----------------
[25 Sep 2006 8:00]
Jonas Oreland
Hi jeb, This looks like a DD bug...(that I also seen every now and then) But I have not been able to create a short reproducable testcase. I would be most happy if you helped me with that... This i my guess on what to do... 1) create DD objects (tablespace/logfile group) 2) create table, insert rows 3) perform SR 4) drop table 5) create new table (with new structure) (will get same id as just dropped), insert rows 6) perform SR or similar...(maybe change SR to NR instead) there is something not working corectly in undo log, when (re)creating table So you need to add rows to table (or elso UNDO will be empty) Also, this is very dependant on how/when LCP is run... So I suggest setting timebetweenlocalcheckpoint to high value and starting it manually by "all dump 7099" Does this sound possible? /Jonas
[26 Sep 2006 17:50]
Jonathan Miller
I was not able to reproduce with the steps that you gave me. I tried several different mix of the steps without issue. I don't have any more time today to spend on this. I will try and look at it again later.
[6 Dec 2006 14:10]
Jonas Oreland
reproducable testcase
Attachment: bug.sh (application/x-shellscript, text), 259 bytes.
[7 Dec 2006 14:27]
Jonas Oreland
fixed by same cheset as bug#17605
[8 Dec 2006 9:37]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/16635 ChangeSet@1.2349, 2006-12-08 10:24:48+01:00, jonas@perch.ndb.mysql.com +2 -0 ndb - bug#21948 Fix bug in previous changeset...causing SR problems with disk
[28 Dec 2006 14:04]
Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release. If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at http://dev.mysql.com/doc/en/installing-source.html Documented fix in 5.1.15 changelog.