Bug #22704 Cluster Crashes during NR while DD schema operations are being handled
Submitted: 26 Sep 2006 17:06 Modified: 5 Jul 2007 21:10
Reporter: Jonathan Miller Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Disk Data Severity:S1 (Critical)
Version:5.1.12-main OS:Linux (Linux 32 Bit OS)
Assigned to: Tomas Ulin CPU Architecture:Any

[26 Sep 2006 17:06] Jonathan Miller
Description:
I was trying Jonas's suggestions from bug#21948. I was not able to reproduce 21948 with Jonas's instructions, so I tried to reproduce it using cid_ndb_dd.pl using disk data. 

This testing was a little different that the last testing that produced 21948, as I was using DBT2 with mixed tables (memory and disk data) and the cid_ndb_dd.pl was using stright memory tables.

The test scripts creates a log group file, a table space, a database, a table, inserts data, deletes data, drop table, drops database, drops table space, drops log file group and repeats.

I killed one of the data nodes just as the delete of the data was about to take place. I then restarted it as the test was dropping and creating LGF and TS. I am not sure if it was the drop that caused this or the create, but then node left running crashed with the following error:

Time: Tuesday 26 September 2006 - 15:44:34
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 13735) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 16669
Trace: /space/run/ndb_2_trace.log.1
Version: Version 5.1.12 (beta)

JAM CONTENTS up->down left->right ?=not block entry
BLOCK   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR   ADDR
       ?006179 006179 006179 006179 006179 006179 006179 006179
        006179 006179 006179 006179 006179 006179 006179 006179
        006179 006179 006179 006179 006179 006179 006179 006179
        006179 006179 006179 006179 006179 006179 006179 006179
        006179 006179 006179 006179 006179 006179 006179 006179

--------------- Signal ----------------
r.bn: 250 "DBDICT", r.proc: 2, r.sigId: 489468 gsn: 410 "DICT_LOCK_REQ" prio: 1
s.bn: 246 "DBDIH", s.proc: 3, s.sigId: 180875 length: 3 trace: 0 #sec: 0 fragInf: 0
 H'00000000 H'00000001 H'00f60003
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 2, r.sigId: 489467 gsn: 409 "TIME_SIGNAL" prio: 1 s.bn: 252 "QMGR", s.proc: 2, s.sigId: 489466 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------
r.bn: 252 "QMGR", r.proc: 2, r.sigId: 489466 gsn: 164 "CONTINUEB" prio: 0 s.bn: 252 "QMGR", s.proc: 2, s.sigId: 489464 length: 1 trace: 0 #sec: 0 fragInf: 0
 H'00000004
--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 2, r.sigId: 489465 gsn: 164 "CONTINUEB" prio: 0 s.bn: 253 "NDBFS", s.proc: 2, s.sigId: 489463 length: 1 trace: 0 #sec: 0 fragInf: 0
 Scanning the memory channel every 10ms
--------------- Signal ----------------

How to repeat:
Create a 2 data node cluster.
start running cid_ndb_dd.pl
./cid_ndb_dd.pl -w
let it run for a couple of interations

Once it inserts data, count for 4 - 5 seconds, and then killall -11 one of the data nodes. Then restart that data node right away.
[26 Sep 2006 17:12] Jonathan Miller
Error log

Attachment: ndb_2_error.log (text/x-log), 568 bytes.

[26 Sep 2006 17:12] Jonathan Miller
test script

Attachment: cid_ndb_dd.pl (application/x-perl, text), 19.70 KiB.

[28 Sep 2006 11:03] Jonathan Miller
Yep, the updated title is better. Thanks
[16 Dec 2006 10:12] Jonas Oreland
Hi,

I think has been fixed by other bug fixes.

Can you please retest ?

/Jonas
[26 Mar 2007 12:50] Jonathan Miller
-so [--socket]          :Connect using socket (default false)

         -sp [--spath=string]    :socket path and file name
                                  (default /tmp/mysql.sock)
[5 Jul 2007 21:10] Jonathan Miller
Since the patch for 29501 has been pushed I have not been able to cause any other data node failures on recovery. Therefore, I am closing this as can't repeat
/Jeb