Bug #34072 Multi-threaded update off on-disk BLOB data creates inconsistencies/crash
Submitted: 25 Jan 2008 22:51 Modified: 19 Dec 2008 18:22
Reporter: Jeff Wang Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1.22 OS:Any
Assigned to: Jonas Oreland CPU Architecture:Any

[25 Jan 2008 22:51] Jeff Wang
Description:
I have multiple threads updating on-disk text data.  I will occasionally get inconsistencies in data where I can "select" something by PK but then updating/delete will get something like:

"Can't find record in 'dt_1' key= 2070"

How to repeat:
Set up a 3 node archicture with 2 data nodes.  Master config as follows:

------
# Options affecting ndbd processes on all data nodes:
[ndbd default]
NoOfReplicas=2      # Number of replicas
DataMemory=1500M    # How much memory to allocate for data storage
IndexMemory=150M    # How much memory to allocate for index storage
StringMemory=99     #expressed as a percentage, 100%=5 MB, values > 99 interpreted as bytes
NoOfFragmentLogFiles=64 #increase this number if the number of inserts/updates is large, this should be ok for 200k/hr
RedoBuffer=32M
MaxNoOfAttributes=40000
MaxNoOfTables=1600
MaxNoOfOrderedIndexes=3000

#amount of time to elapse before aborting the transaction
#and assuming deadlock
TransactionDeadlockDetectionTimeout=10000 #in ms

#amount of time between operation in the same transaction
#0 indicates no timeout
#units in ms
#TransactionInactiveTimeout=0

MaxNoOfConcurrentOperations=500000

# TCP/IP options:
[tcp default]
#portnumber=2202               # This the default; however, you can use any
                               # port that is free for all the hosts in the cluster
                               # Note: It is recommended that you do not specify the
                               # portnumber at all and allow the default value to be
                               # used instead
SendBufferMemory=2M
ReceiveBufferMemory=1M
Checksum=1                        #detect corrupted messages

*cut rest of conig for brevity as nothing special in other sections*
--------

Steps:

1) Create on-disk data table :

   CREATE TABLE dt_1 (\
                 row_key                VARCHAR(75),\
                 data                   TEXT NOT NULL,\
                 PRIMARY KEY (row_key)
              )\
              TABLESPACE ts_1 STORAGE DISK\
              ENGINE NDB;

2) Insert data into table

3) Create a script that has multiple threads updating data in the table.
4) Turn AUTO-COMMIT off, and manually commit after each update.
5) Run script and randomly stop the script  after 10-30 seconds.

You should see the "Can't find record" error after several runs (up to 15).  Once the
error is seen and you run the scripts with the errors over and over, the data nodes will crash.
[25 Jan 2008 22:53] Jeff Wang
Java class that crashes nodes

Attachment: CrashTest.java (, text), 4.66 KiB.

[25 Jan 2008 22:55] Jeff Wang
I've attached  a java class that will insert and update content.  Change the connect string in the java file to connect to your mysql.

Then, run "java CrashTest insert" to insert data.

Then run "java CrashTest update" to start updating.  Stop the script at 10-30 second intervals and re-run it.  After several runs, errors should occur.
[26 Jan 2008 8:02] Jonas Oreland
Hi,

We fixed a blob inconsistency matching your description (wo/ bug report :(
which will be in 5.1.23

but we never (as far as i know) got this to crash

Can you 
1) upload tracefiles (if it's ndbd crashing)
2) test your program on 5.1.23 (when it's released if not already)

/jonas
[28 Jan 2008 18:01] Jeff Wang
trace log for node crash

Attachment: trace.log (, text), 126.50 KiB.

[28 Jan 2008 18:02] Jeff Wang
Hi,

I've submitted the end of my trace log for the data node crash.  The whole log was too large to submit (1.9 MB).  The error log said:

                                                                               
Time: Monday 28 January 2008 - 09:54:15
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dblqh/DblqhMain.cpp
Error object: DBLQH (Line: 6950) 0x0000000a
Program: ndbd
Pid: 7641
Trace: /Users/dbsp/work/cluster/data/ndb_2_trace.log.19
Version: Version 5.1.22 (rc)
***EOM***
                                                                      

Version 5.1.23 is not available yet but I'll test when it comes out.  

thanks
[13 Aug 2008 21:00] Jonas Oreland
did you retest this on a newer version (maybe 6.2 or 6.3)
or did you give up ?

/jonas
[13 Aug 2008 23:48] Jeff Wang
We gave up on this for the moment as it was too a little too unstable for full production use. As I recall, I did try it on 5.1.24 and the problem didn't appear... but that was so many months ago that I can't be sure.