MySQL Bugs: #38543: NDB Java Bindings stability issues.

Bug #38543	NDB Java Bindings stability issues.
Submitted:	4 Aug 2008 15:37	Modified:	21 Sep 2009 9:37
Reporter:	Anton Bobrov	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	Connectors: NDB/Bindings	Severity:	S1 (Critical)
Version:	0.7.0	OS:	Linux (RHEL 5.1 (Tikanga) / 2.6.18.53.el5)
Assigned to:	Monty Taylor	CPU Architecture:	Any

Description:
NDB Java Bindings are extremely unstable under heavy usage eg multiple read
operations across several tables using a single transaction. It seems that
the SWIG layer is directly responsible due to some memory corruption caused
by the native code. Running with debug memory allocator reveals repeatable
errors such as this:
java(15216,0x8eb800) malloc: *** error for object 0x16c9a0: incorrect checksum for freed object - object was probably modified after being freed.
which eventually results in SEGV somewhere in NDB native library although
the cause is very likely invalid references passed by SWIG layer there.

some most common backtraces :

*** glibc detected *** /usr/java/jdk1.5.0_16/bin/java: free(): invalid next size (fast): 0x0000000060c35220 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3a5da6f444]
/lib64/libc.so.6(cfree+0x8c)[0x3a5da72a6c]
/usr/local/mysql-5.1.24-ndb-6.2.14-linux-x86_64/lib/mysql/libndbj.so.0.0.0(Java_com_mysql_cluster_ndbj_ndbjJNI_NdbRecAttrImpl_1getString+0xb1)[0x2aaac088b4ef]
[0x2aaaaec98fd4]

*** glibc detected *** /usr/java/jdk1.5.0_16/bin/java: corrupted double-linked list: 0x00000000522e7430 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3a5da6da07]
/lib64/libc.so.6[0x3a5da6fa42]
/lib64/libc.so.6(__libc_malloc+0x7d)[0x3a5da70c9d]
/usr/java/jdk1.5.0_16/jre/lib/amd64/server/libjvm.so[0x2aaaab02bc78]
/usr/java/jdk1.5.0_16/jre/lib/amd64/server/libjvm.so[0x2aaaab1135fe]
[0x2aaaaea58522]

*** glibc detected *** /usr/java/jdk1.5.0_16/bin/java: double free or corruption (!prev): 0x000000004bbdc530 ***
======= Backtrace: =========
/lib64/libc.so.6[0x3a5da6f444]
/lib64/libc.so.6(cfree+0x8c)[0x3a5da72a6c]
/usr/local/mysql/lib/mysql/libndbclient.so.4(_ZN7NdbImplD1Ev+0xa4)[0x2aaac0b499c4]
/usr/local/mysql/lib/mysql/libndbclient.so.4(_ZN3NdbD1Ev+0x127)[0x2aaac0b49d97]
/usr/local/mysql-5.1.24-ndb-6.2.14-linux-x86_64/lib/mysql/libndbj.so.0.0.0[0x2aaac0882a54]
/usr/local/mysql-5.1.24-ndb-6.2.14-linux-x86_64/lib/mysql/libndbj.so.0.0.0(Java_com_mysql_cluster_ndbj_ndbjJNI_NdbImpl_1close+0x56)[0x2aaac08a9330]
[0x2aaaaea58522]

How to repeat:
its very difficult to isolate this into a separate test case but
i can setup a test system in our lab and provide access to it if
needed.

We're looking in to it right now. I'm guessing we're just not incrementing the refcount of a JNI created object properly somewhere... those get really fun to find. :)

just to give a better idea on what the application is doing which
could perhaps help with finding the root cause/s. i will describe
one particular use case used to get mentioned abort() backtraces
[ excuse my lame pseudo code here ] :

start txn;

txn select index scan;

txn execute no commit;

for each result from select index scan {
  get single index scan result;
  // number of operations depending
  // on result values returned above.
  txn get select operaion1 on table1;
  txn get select operation on table2;
  txn get select operation on table3;
  txn execute no commit;
  get select/s result/s;
}

so all that done under single txn and index scan results are
obtained one by one [ this is basically cursor like thing
that can walk the entire database ] so we are constantly
making new select operations and executing them sans commit
as well as getting and processing their results while the 
results of the very first operation [ index scan ] are still
open and being referenced/processed.

hs_err_pid1753 shows another crash although its not related to the
previous ones. this time no memory allocator errors seen before the
crash and application activity consisted of doing bulk inserts into
bunch of tables. multiple inserts, under multiple transactions: one
txn doing multiple inserts then executing and closing then another
txn starts and so on. crashed after roughly 4k something txn execs.