Bug #21948 Repeated create/drop/truncate together with restarts can cause SR failure
Submitted: 31 Aug 2006 15:26 Modified: 28 Dec 2006 14:04
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S1 (Critical)
Version:5.1.12 OS:Linux (Linux 64 bit OS)
Assigned to: Jonas Oreland CPU Architecture:Any

[31 Aug 2006 15:26] Jonathan Miller
Description:
Trying to reproduce Bug 21015 for Martin, I recieved a cluster crash.

I had DBT2 (Mixed Data) test running and I had cid_dd_ndb.pl running. I killall -11 ndbd on the second DN then restarted it.

ndb_mgm showed:

ndb_mgm> Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error
Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Second data node error log shows:

Time: Thursday 31 August 2006 - 17:09:05
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupDiskAlloc.cpp
Error object: DBTUP (Line: 277) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 11893
Trace: /space/run/ndb_3_trace.log.2
Version: Version 5.1.12 (beta)
***EOM***

Data node #1 error log: (Which I think is maybe the same as http://bugs.mysql.com/bug.php?id=21017)

Time: Thursday 31 August 2006 - 17:09:05
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 13643) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 16528
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.12 (beta)
***EOM***

Cluster Log:
 Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error
2006-08-31 17:09:05 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error
2006-08-31 17:09:05 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-08-31 17:09:05 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-08-31 17:09:06 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2006-08-31 17:09:06 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

MySQLD Log:
060831 17:05:19 [ERROR] Got error 4028 when reading table './dbt2/stock'
060831 17:05:19 [ERROR] Got error 286 when reading table './dbt2/stock'
060831 17:05:19 [ERROR] Got error 4025 when reading table './dbt2/item'
060831 17:05:19 [ERROR] Got error 4025 when reading table './dbt2/warehouse'
060831 17:09:05 [ERROR] NDB Binlog: Unable to drop event in database. Event: REPL$TESTER/t1 Error Code: 4009 Message: Cluster Failure
060831 17:09:05 [ERROR] Got error 4025 when reading table './dbt2/district'
060831 17:09:05 [ERROR] Got error 4025 when reading table './dbt2/district'
060831 17:09:05 [ERROR] Got error 4025 when reading table './dbt2/customer'
060831 17:11:06 [ERROR] NDB delete table: ./TESTER/t1 timed out. Ignoring...

How to repeat:
http://bugs.mysql.com/bug.php?id=21015
[31 Aug 2006 15:34] Jonathan Miller
mgm out log

Attachment: ndb_1_out.log (text/x-log), 451 bytes.

[31 Aug 2006 18:55] Jonathan Miller
Still trying to reproduce the other bug, I reproduced this one, but got a little different ndb error this time:

Time: Thursday 31 August 2006 - 20:49:30
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: tsman.cpp
Error object: TSMAN (Line: 1822) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 14252
Trace: /space/run/ndb_3_trace.log.7
Version: Version 5.1.12 (beta)
[11 Sep 2006 16:14] Jonas Oreland
Jeb,

Did you test this on 5.1-main or 5.1-ndb
Pekkas fix for this was pushed to main tree "1 Sep 9:58"

Can you retest ?

/Jonas
[21 Sep 2006 22:36] Jonathan Miller
If first tried to reproduce this using memory tables and had no issues. I then used mixed of memory tables and disk tables. During recovery I recieved:
Node 2: Forced node shutdown completed. Occured during startphase 5. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error

Time: Friday 22 September 2006 - 00:11:09
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: tsman.cpp
Error object: TSMAN (Line: 1822) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 23529
Trace: /space/run/ndb_2_trace.log.2
Version: Version 5.1.12 (beta)
***EOM***

JAM CONTENTS up->down left->right ?=not block entry
BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
       ?000915
NDBFS   000513 000173
PGMAN   000209 000218
DBLQH   002484
DBTC    004047
DBTUP   002032
NDBFS   001040 001042 000853 000855 000773 000809 001054
QMGR    000116 000140 002090
PGMAN   001487
NDBFS   001040 000853 001059
PGMAN   000209 000218 000954 000981
DBTUP   001308 020243 020245
TSMAN   001757 001032
LGMAN   000162 000208 001605 000650 000653
DBTUP   001308 020243 020243 020245

--------------- Signal ----------------
r.bn: 260 "LGMAN", r.proc: 2, r.sigId: 4534222 gsn: 164 "CONTINUEB" prio: 1
s.bn: 249 "DBTUP", s.proc: 2, s.sigId: 4534221 length: 1 trace: 4 #sec: 0 fragInf: 0
 H'00000005
--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 4534221 gsn: 164 "CONTINUEB" prio: 1
s.bn: 261 "PGMAN", s.proc: 2, s.sigId: 4534219 length: 1 trace: 4 #sec: 0 fragInf: 0
 H'00000001
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 4534220 gsn: 164 "CONTINUEB" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 4534217 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel again with no delay
--------------- Signal ----------------
r.bn: 261 "PGMAN", r.proc: 2, r.sigId: 4534219 gsn: 262 "FSREADCONF" prio: 1
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 4534217 length: 1 trace: 4 #sec: 0 fragInf: 0
 UserPointer: 1417
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 2, r.sigId: 4534218 gsn: 164 "CONTINUEB" prio: 0
s.bn: 252 "QMGR", s.proc: 2, s.sigId: 4534216 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 4534217 gsn: 164 "CONTINUEB" prio: 0
s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 4534215 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel every 10ms
--------------- Signal ----------------
[25 Sep 2006 8:00] Jonas Oreland
Hi jeb,

This looks like a DD bug...(that I also seen every now and then)
But I have not been able to create a short reproducable testcase.
I would be most happy if you helped me with that...

This i my guess on what to do...
1) create DD objects (tablespace/logfile group)
2) create table, insert rows
3) perform SR
4) drop table
5) create new table (with new structure) (will get same id as just dropped), insert rows
6) perform SR

or similar...(maybe change SR to NR instead)
there is something not working corectly in undo log, when (re)creating table
So you need to add rows to table (or elso UNDO will be empty)

Also, this is very dependant on how/when LCP is run...
So I suggest setting timebetweenlocalcheckpoint to high value
  and starting it manually by "all dump 7099"

Does this sound possible?
/Jonas
[26 Sep 2006 17:50] Jonathan Miller
I was not able to reproduce with the steps that you gave me. I tried several different mix of the steps without issue. I don't have any more time today to spend on this. I will try and look at it again later.
[6 Dec 2006 14:10] Jonas Oreland
reproducable testcase

Attachment: bug.sh (application/x-shellscript, text), 259 bytes.

[7 Dec 2006 14:27] Jonas Oreland
fixed by same cheset as bug#17605
[8 Dec 2006 9:37] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/16635

ChangeSet@1.2349, 2006-12-08 10:24:48+01:00, jonas@perch.ndb.mysql.com +2 -0
  ndb - bug#21948
    Fix bug in previous changeset...causing SR problems with disk
[28 Dec 2006 14:04] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented fix in 5.1.15 changelog.