Bug #49696 Error 6000 with multithreaded ndbmtd
Submitted: 14 Dec 2009 21:14 Modified: 8 Jan 2010 21:07
Reporter: Troy A Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.0.9 OS:Linux
Assigned to: Jonas Oreland CPU Architecture:Any
Tags: 6000, Abort, ndbmtd, SIBABRT, threaded

[14 Dec 2009 21:14] Troy A
Description:
We have 4 physical servers, 2 nodegroups, 2 replicas in each nodegroup.  On a specific usage pattern, one of the data nodes will abort(), and cause the cluster to shut down.  This only happens with the multithreaded daemon.

How to repeat:
With the first query potentially matching tens of thousands of rows without the LIMIT, The suspected breakage pattern is:

BEGIN
SELECT * FROM table1 WHERE foo IS NOT NULL LIMIT 1000 FOR UPDATE
SELECT COUNT(*) FROM table1 where foo IS NOT NULL
(loop in code)
DELETE FROM table1 WHERE id=?
(end loop)
COMMIT

Suggested fix:
Workaround: use ndbd instead of ndbmtd.
[14 Dec 2009 21:23] Troy A
relevant mysql traces from all four data nodes

Attachment: mysql-traces.tar.lzma (application/x-lzma, text), 341.77 KiB.

[14 Dec 2009 22:08] Jonas Oreland
Hi Troy,

I couldnt find any cluster log,
and error logs, nor any ndb_X_out.log files

Can you attach them too ?
[14 Dec 2009 22:25] Troy A
Relevant error log from the four nodes

Time: Friday 11 December 2009 - 17:14:39
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing 
error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: ndbd.cpp
Program: /usr/libexec/ndbmtd
Pid: 31689 thr: 0
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /var/lib/mysql-cluster/ndb_1_trace.log.17 /var/lib/mysql-cluster/ndb_1_trace.log.17_t1 /var/lib/mysql-cluster/ndb_1_trace.log.17_t2 /var/lib/m

Time: Friday 11 December 2009 - 17:14:40
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming
 error or missing error message, please report a bug)
Error: 2341
Error data: SimulatedBlock.cpp
Error object: CMVMI (Line: 512) 0x0000000e
Program: /usr/libexec/ndbmtd
Pid: 4837 thr: 3
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /var/lib/mysql-cluster/ndb_2_trace.log.15 /var/lib/mysql-cluster/ndb_2_trace.log.15_t1 /var/lib/mysql-cluster/ndb

Time: Friday 11 December 2009 - 17:14:44
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cl
uster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5532) 0x0000000e
Program: /usr/libexec/ndbmtd
Pid: 20143 thr: 0
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.18 /var/lib/mysql-cl

Time: Friday 11 December 2009 - 17:14:42
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5532) 0x0000000e
Program: /usr/libexec/ndbmtd
Pid: 9792 thr: 0
Version: mysql-5.1.39 ndb-7.0.9b
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.20 /var/lib/mysql-clu
[14 Dec 2009 22:29] Troy A
out.log from each data node

Attachment: out.log.tar.gz (application/x-gzip, text), 220.36 KiB.

[14 Dec 2009 22:30] Troy A
Apologies in advance if I still missed any files that are required.  Thanks for taking a look.
[14 Dec 2009 22:36] Jonas Oreland
Hi,

It looks like a duplicate of http://bugs.mysql.com/bug.php?id=49589 and http://bugs.mysql.com/bug.php?id=48852

But...more files are need to be sure
ndb_X_error.log,
ndb_Y_cluster.log*
config.ini

/Jonas
[14 Dec 2009 22:49] Troy A
Remainder of logs

Attachment: error_cluster_logs.tar.gz (application/x-gzip, text), 478.84 KiB.

[14 Dec 2009 22:51] Troy A
config.ini

Attachment: config.ini (application/octet-stream, text), 1.78 KiB.

[14 Dec 2009 22:57] Troy A
Well, it seems http://bugs.mysql.com/bug.php?id=48852 is just related to how it reports the error, so it wouldn't be causal.  I believe earlier Error 6000s on one node even caused different messages in our error.logs on the corresponding node, such as:

Error: 2303
Error data: Node 2 killed this node because GCP stop was detected

The other one, http://bugs.mysql.com/bug.php?id=49589 seems plausible to me.

Thanks again!
[14 Dec 2009 23:02] Jonas Oreland
Hi again,

1) yes, 48852 is only how the error is reported...but if it was reported
correctly you probably would have increased the size of your send buffers.

2) the out files are filled with "0 - sleeploop 10!!"
  which is very very bad...So what kind of load are you running ??
  would it be possible to somehow try to mimic it?

3) you still havent uploaded the config.ini, right ?

/Jonas
[14 Dec 2009 23:14] Troy A
1) Makes sense.  We did see that bug before and increased sendbuffers to see if it fixed the problem.  We didn't see any change so we went back to defaults.

2) We have mogilefs running against the cluster, as well as our own app.  It's not in full production quite yet so we're at liberty to switch back to mt to test.  The usage from our app shouldn't be anything special.  I was only using two connections (to two databases) from the app, and looping through the statements given in the original bug description seemed to be the only trigger for the crash, as mogilefs should have been relatively idle.  I did have some large transactions that failed previously, but I've rewritten the app to use smaller batches when possible.

3) config.ini was uploaded separately above.

Thanks.
[21 Dec 2009 12:25] Jonas Oreland
This could also maybe be http://bugs.mysql.com/bug.php?id=49589,
which now has a patch attached...

Can you test that (together with increased buffers) and see it it helps
your case

/Jonas
[8 Jan 2010 21:07] Troy A
Sorry for the delay.

I wanted to report that with the patch applied, we have not been able to reproduce the crashes.

Thanks for your time, Jonas.

-Troy