MySQL Bugs: #49696: Error 6000 with multithreaded ndbmtd

Bug #49696	Error 6000 with multithreaded ndbmtd
Submitted:	14 Dec 2009 21:14	Modified:	8 Jan 2010 21:07
Reporter:	Troy A	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.0.9	OS:	Linux
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	6000, Abort, ndbmtd, SIBABRT, threaded

Description:
We have 4 physical servers, 2 nodegroups, 2 replicas in each nodegroup.  On a specific usage pattern, one of the data nodes will abort(), and cause the cluster to shut down.  This only happens with the multithreaded daemon.

How to repeat:
With the first query potentially matching tens of thousands of rows without the LIMIT, The suspected breakage pattern is:

BEGIN
SELECT * FROM table1 WHERE foo IS NOT NULL LIMIT 1000 FOR UPDATE
SELECT COUNT(*) FROM table1 where foo IS NOT NULL
(loop in code)
DELETE FROM table1 WHERE id=?
(end loop)
COMMIT

Suggested fix:
Workaround: use ndbd instead of ndbmtd.

relevant mysql traces from all four data nodes

Attachment: mysql-traces.tar.lzma (application/x-lzma, text), 341.77 KiB.

Hi Troy,

I couldnt find any cluster log,
and error logs, nor any ndb_X_out.log files

Can you attach them too ?

Relevant error log from the four nodes

Time: Friday 11 December 2009 - 17:14:39
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing 
error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: ndbd.cpp
Program: /usr/libexec/ndbmtd
Pid: 31689 thr: 0
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /var/lib/mysql-cluster/ndb_1_trace.log.17 /var/lib/mysql-cluster/ndb_1_trace.log.17_t1 /var/lib/mysql-cluster/ndb_1_trace.log.17_t2 /var/lib/m

Time: Friday 11 December 2009 - 17:14:40
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming
 error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: CMVMI (Line: 512) 0x0000000e
Program: /usr/libexec/ndbmtd
Pid: 4837 thr: 3
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /var/lib/mysql-cluster/ndb_2_trace.log.15 /var/lib/mysql-cluster/ndb_2_trace.log.15_t1 /var/lib/mysql-cluster/ndb

Time: Friday 11 December 2009 - 17:14:44
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cl
uster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5532) 0x0000000e
Program: /usr/libexec/ndbmtd
Pid: 20143 thr: 0
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.18 /var/lib/mysql-cl

Time: Friday 11 December 2009 - 17:14:42
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5532) 0x0000000e
Program: /usr/libexec/ndbmtd
Pid: 9792 thr: 0
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.20 /var/lib/mysql-clu

out.log from each data node

Attachment: out.log.tar.gz (application/x-gzip, text), 220.36 KiB.

Apologies in advance if I still missed any files that are required.  Thanks for taking a look.

Hi,

It looks like a duplicate of http://bugs.mysql.com/bug.php?id=49589 and http://bugs.mysql.com/bug.php?id=48852

But...more files are need to be sure
ndb_X_error.log,
ndb_Y_cluster.log*
config.ini

/Jonas

Remainder of logs

Attachment: error_cluster_logs.tar.gz (application/x-gzip, text), 478.84 KiB.

config.ini

Attachment: config.ini (application/octet-stream, text), 1.78 KiB.

Well, it seems http://bugs.mysql.com/bug.php?id=48852 is just related to how it reports the error, so it wouldn't be causal.  I believe earlier Error 6000s on one node even caused different messages in our error.logs on the corresponding node, such as:

Error: 2303
Error data: Node 2 killed this node because GCP stop was detected

The other one, http://bugs.mysql.com/bug.php?id=49589 seems plausible to me.

Thanks again!

Hi again,

1) yes, 48852 is only how the error is reported...but if it was reported
correctly you probably would have increased the size of your send buffers.

2) the out files are filled with "0 - sleeploop 10!!"
  which is very very bad...So what kind of load are you running ??
  would it be possible to somehow try to mimic it?

3) you still havent uploaded the config.ini, right ?

/Jonas

1) Makes sense.  We did see that bug before and increased sendbuffers to see if it fixed the problem.  We didn't see any change so we went back to defaults.

2) We have mogilefs running against the cluster, as well as our own app.  It's not in full production quite yet so we're at liberty to switch back to mt to test.  The usage from our app shouldn't be anything special.  I was only using two connections (to two databases) from the app, and looping through the statements given in the original bug description seemed to be the only trigger for the crash, as mogilefs should have been relatively idle.  I did have some large transactions that failed previously, but I've rewritten the app to use smaller batches when possible.

3) config.ini was uploaded separately above.

Thanks.

This could also maybe be http://bugs.mysql.com/bug.php?id=49589,
which now has a patch attached...

Can you test that (together with increased buffers) and see it it helps
your case

/Jonas

Sorry for the delay.

I wanted to report that with the patch applied, we have not been able to reproduce the crashes.

Thanks for your time, Jonas.

-Troy