Bug #14935 Data nodes crash when ordered index full
Submitted: 15 Nov 2005 5:16 Modified: 28 Jul 2006 11:29
Reporter: Mark Kirkwood Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1 OS:Linux (linux)
Assigned to: Pekka Nousiainen CPU Architecture:Any

[15 Nov 2005 5:16] Mark Kirkwood
Description:
Inserting data into a table using LOAD DATA results in:

ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort of trans
action' from NDBCLUSTER

if the datafile is bigger than datamemory. The cluster cannot be successfully restarted theerafter.

One of the NDB nodes creates an error file:

Current byte-offset of file-pointer is: 568

Time: Tuesday 15 November 2005 - 14:01:41
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming
 error or missing error message, please report a bug)
Error: 2341
Error data: DbtupIndex.cpp
Error object: DBTUP (Line: 67) 0x0000000a
Program: /usr/local/mysql/5.1/libexec/ndbd
Pid: 20644
Trace: /data0/myndb/5.1/ndb_3_trace.log.1
Version: Version 5.1.3 (alpha)

How to repeat:
$ head /data0/dump/bug.dat
0,xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxx
1,xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxx
2,xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxx
.
.
.

(1000000 rows - approx 100M.)

mysql > CREATE TABLE bug (
        id INTEGER NOT NULL,
        val text NOT NULL)
ENGINE=ndb;

mysql > CREATE UNIQUE INDEX bug_id ON bug(id);

mysql > LOAD DATA LOCAL INFILE '/data0/dump/bug.dat'
INTO TABLE bug FIELDS TERMINATED BY ',';

ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort of trans
action' from NDBCLUSTER

$ cat config.ini
# Example Ndbcluster storage engine config file.
#
[ndbd default]
NoOfReplicas= 1
MaxNoOfConcurrentOperations= 10000
DataMemory= 50M
IndexMemory= 256M
TimeBetweenWatchDogCheck= 30000
DataDir= /data0/myndb/5.1
MaxNoOfOrderedIndexes= 512

[ndb_mgmd default]
DataDir= /data0/myndb/5.1

[ndb_mgmd]
Id=1
HostName= localhost

[ndbd]
Id= 2
HostName= localhost

[ndbd]
Id= 3
HostName= localhost

[mysqld]
Id= 4

[mysqld]
Id= 5

# choose an unused port number
# in this configuration 63132, 63133, and 63134
# will be used
[tcp default]
PortNumber= 63132

Note that there are 2 fragments, so presumably 2x50M of datamemory. This is still not quite enough!

Suggested fix:
Hmm - I guess just tell me that I am out of memory and rollback the LOAD.
[15 Nov 2005 5:17] Mark Kirkwood
Fix a couple of horrible typos....sorry.
[15 Nov 2005 5:19] Mark Kirkwood
ndb error log

Attachment: ndb_2_error.log (application/octet-stream, text), 568 bytes.

[15 Nov 2005 5:19] Mark Kirkwood
other ndb error log

Attachment: ndb_3_error.log (application/octet-stream, text), 568 bytes.

[15 Nov 2005 5:20] Mark Kirkwood
ndb trace

Attachment: ndb_2_trace.log.1.gz (application/gzip, text), 30.25 KiB.

[15 Nov 2005 5:20] Mark Kirkwood
other ndb trace

Attachment: ndb_3_trace.log.1.gz (application/gzip, text), 42.11 KiB.

[6 Feb 2006 11:16] Jonas Oreland
I retested this on 5.1.6 wo/ luck :-)
I.e i could not reproduce it.

Can you try on a newer version (5.1.6)
[7 Feb 2006 7:25] Mark Kirkwood
Retested with 5.1.7 :

Get error as before:

ERROR 1297 (HY000): Got temporary error 4010 'Node failure caused abort of transaction' from NDBCLUSTER

However, cluster can be stopped and restarted ok.
So looks like it is fixed.
[14 Mar 2006 15:02] Jonas Oreland
Hi again,

I get "table full", which is correct.
4010 is not correct.

I've tried and tried and tried...(2 node 2 replica 50M datamem)
Does the following dump.dat look similar to yours?
i=1000000; while [ $i -gt 999999 ] ; do echo "$i, jonas" >> dump.dat; i=`expr $i - 1`; done

Do you have any other hint on how to reproduce this?
/Jonas
[14 Mar 2006 15:08] Jonas Oreland
(also tested with 1 replica)
[14 Apr 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[14 Apr 2006 23:13] Mark Kirkwood
Sorry Jonas - I seem to have missed that email.

I'll see if I can send you exactly what I did (I'm away on business at the moment, so it will be in the next few days!)
[14 Apr 2006 23:26] Mark Kirkwood
Quickly looking at your script - the essential differences would seem to be:

1/ Mine does the id ascending (shouldn't matter, but who knows).
2/ Mine uses a considerably longer text value (may matter).

You are creating the index too? If so, then (provided you are not sick of this bug by now :-) ), try generating a dump file that is exactly like I indicated in the first post (i.e ascending 0->999999 id, and val with length of 100 and composed entirely of 'x' chars.

Mark
[16 Apr 2006 4:17] Mark Kirkwood
Data generator

Attachment: data.pl (application/octet-stream, text), 209 bytes.

[16 Apr 2006 4:19] Mark Kirkwood
Just retested and confirmed error with 5.1.7. I've added the data generator file 'data.pl' so that you can use the same data as I did!
[16 Apr 2006 8:35] Mark Kirkwood
I notice that if I *don't create the unique index*, then I get:

ERROR 1114 (HY000): The table 'bug' is full

like you guys are getting.
[17 Apr 2006 11:31] Jonathan Miller
Time: Monday 17 April 2006 - 13:27:46
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error
or missing error message, please report a bug)
Error: 2341
Error data: dbtup/DbtupIndex.cpp
Error object: DBTUP (Line: 66) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 20300
Trace: /space/run//ndb_2_trace.log.1
Version: Version 5.1.10 (beta)
***EOM***

--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 3066899 gsn: 4 "ATTRINFO" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: -1 length: 23 trace: 1 #sec: 0 fragInf: 0
 H'00002f19 H'00049aae H'00500400 H'00000000 H'00000000 H'00000000 H'00000000
 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000
 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000
 H'00000000 H'00000000
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 3066898 gsn: 4 "ATTRINFO" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: -1 length: 25 trace: 1 #sec: 0 fragInf: 0
 H'00002f19 H'00049aae H'00500400 H'78787878 H'78787878 H'00787878 H'00000000
 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000
 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000 H'00000000
 H'00000000 H'00000000 H'00000000 H'00000000
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 3066897 gsn: 4 "ATTRINFO" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: -1 length: 25 trace: 1 #sec: 0 fragInf: 0
 H'00002f19 H'00049aae H'00500400 H'78787878 H'78787878 H'78787878 H'78787878
 H'78787878 H'78787878 H'78787878 H'78787878 H'78787878 H'78787878 H'78787878
 H'78787878 H'78787878 H'78787878 H'78787878 H'78787878 H'78787878 H'78787878
 H'78787878 H'78787878 H'78787878 H'78787878
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 3066896 gsn: 316 "LQHKEYREQ" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: -1 length: 19 trace: 1 #sec: 0 fragInf: 0
 ClientPtr = H'00002f19 hashValue = H'bbc10e8b tcBlockRef = H'00f50003
 transId1 = H'00049aae transId2 = H'00500400 savePointId = H'00000000
 Op: 2 Lock: 0 Flags: CommitAckMarker NoDisk ScanInfo/noFiredTriggers: H'24d4f
 AttrLen: 69 (5 in this) KeyLen: 1 TableId: 7 SchemaVer: 1
 FragId: 0 ReplicaNo: 0 LastReplica: 0 NextNodeId: 0
 ApiRef: H'80050004 ApiOpRef: H'00000014
 KeyInfo: H'00049a9e
 AttrInfo: H'00000004 H'00049a9e H'00010108 H'00000063 H'00000000
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 3066895 gsn: 342 "PACKED_SIGNAL" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: 5450203 length: 3 trace: 1 #sec: 0 fragInf: 0
Signal data: H'50000000 H'00049aac H'00500400
--------- Begin Packed Signals --------
--------------- Signal ----------------
r.bn: 247 "DBLQH", length: 2 "REMOVE_MARKER"
Signal data: H'00049aac H'00500400
--------- End Packed Signals ----------
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 3066894 gsn: 342 "PACKED_SIGNAL" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: -1 length: 3 trace: 1 #sec: 0 fragInf: 0
Signal data: H'10000003 H'00049aac H'00500400
--------- Begin Packed Signals --------
--------------- Signal ----------------
r.bn: 247 "DBLQH", length: 3 "COMPLETE"
Signal data: H'10000003 H'00049aac H'00500400
--------- End Packed Signals ----------
--------------- Signal ----------------
r.bn: 247 "DBLQH", r.proc: 2, r.sigId: 3066893 gsn: 342 "PACKED_SIGNAL" prio: 1
s.bn: 245 "DBTC", s.proc: 3, s.sigId: -1 length: 4 trace: 1 #sec: 0 fragInf: 0
Signal data: H'00000003 H'0000009c H'00049aac H'00500400
--------- Begin Packed Signals --------
--------------- Signal ----------------
r.bn: 247 "DBLQH", length: 4 "COMMIT"
Signal data: H'00000003 H'0000009c H'00049aac H'00500400
--------- End Packed Signals ----------
--------------- Signal ----------------
[19 Jun 2006 8:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/7827
[19 Jun 2006 9:08] Pekka Nousiainen
Just trivial mismatch of error code TUP -> TUX.
This fixes the DbtupIndex.cpp line 96/97 crash.

On my second test I got this crash on db full:

Error data: dbtup/DbtupCommit.cpp
Error object: DBTUP (Line: 55) 0x0000000e

Keep this bug# open and change title.
[19 Jun 2006 9:11] Pekka Nousiainen
trace

Attachment: ndb_21_trace.log.1.gz (application/gzip, text), 67.80 KiB.

[19 Jul 2006 12:50] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9334
[21 Jul 2006 15:13] Mikael Ronström
Sent review comments via email
[26 Jul 2006 18:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/9617
[28 Jul 2006 11:29] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://www.mysql.com/doc/en/Installing_source_tree.html

Bugfix documented in 5.1.13 changelog.
[28 Jul 2006 19:52] Paul DuBois
There is no 5.1.13 yet. Moving changelog entry to 5.1.12.