MySQL Bugs: #65084: 7.2.4 and 7.2.5 crash in DBLQH under mild load

Bug #65084	7.2.4 and 7.2.5 crash in DBLQH under mild load
Submitted:	24 Apr 2012 6:07	Modified:	22 Jul 2012 4:28
Reporter:	Chris Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.2.4, 7.2.5, 7.2.6	OS:	Linux (CentOS 6.2)
Assigned to:	Santo Leto	CPU Architecture:	Any

Description:
We recently implemented two MySQL clusters (dev and prod), both suffer from crashes under mild load. All nodes are installed under CentOS 6.2 on VMWare virtual servers. The application nodes are running Drupal 6, and are located behind Cisco load balancers that each issue a "ping" http resquest every 5 seconds. This results in approximately 200 queries per second from each application node.

Dev Cluster :

3 application nodes with 4GB memory each, 2 virtual cores
2 data nodes with 12GB memory each, 4 virtual cores
1 management node with 1GB memory

Production Cluster :

4 application nodes with 4GB memory each, 2 virtual cores
2 data nodes with 12GB memory each, 4 virtual cores
1 management node with 1GB memory

Both clusters ran fine under 7.2.4 and 7.2.5 with no load. Once the application nodes were put behind the load balancers and subject to http pings, the data nodes would crash in less than one hour.

The error is consistently as follows :

2012-04-23 18:05:35 [ndbd] INFO -- /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2012-04-23 18:05:35 [ndbd] INFO -- DBLQH (Line: 9764) 0x00000002
2012-04-23 18:05:35 [ndbd] INFO -- Error handler shutting down system
2012-04-23 18:05:35 [ndbd] INFO -- Error handler shutdown completed - exiting
2012-04-23 18:05:35 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

All nodes were originally installed via RPM and config.ini hand tuned. As a test, one cluster was completely reinstalled using severalnines configurator and installation with their defaults. The same issue persists.

We have analyzed the data node logs to look for any signs of resource constraints, but have found nothing conclusive.

How to repeat:
1. Start cluster
2. Subject node to 400-600 queries per second
3. Wait 20-60 minutes

Suggested fix:
None

Error Report files provided to MySQL Team under private comment.

We ran ndbd with --core-file. 

#0  0x00007fb3fabdf885 in raise () from /lib64/libc.so.6
#1  0x00007fb3fabe1065 in abort () from /lib64/libc.so.6
#2  0x0000000000431330 in childAbort (error_code=<value optimized out>,
    exit_code=<value optimized out>, currentStartPhase=255)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/ndbd.cpp:391
#3  0x00000000004316d7 in NdbShutdown (error_code=2341, type=NST_ErrorHandler,
    restartType=NRT_Default)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/ndbd.cpp:861
#4  0x0000000000664fc7 in ErrorReporter::handleError (messageID=2341,
    problemData=0x70a8d0 "/pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp", objRef=0x7fff7948e170 "DBLQH (Line: 9764) 0x00000006",
    nst=NST_ErrorHandler)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/error/ErrorReporter.cpp:256
#5  0x00000000006de725 in SimulatedBlock::progError (
    this=<value optimized out>, line=9764, err_code=2341,
    extra=0x70a8d0 "/pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp")
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/vm/SimulatedBlock.cpp:1817
#6  0x000000000053c406 in Dblqh::execSCAN_NEXTREQ (this=0x2231e70,
    signal=0xc42180)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp:9764
#7  0x00000000006dcb91 in executeFunction (this=0xc4bbc0)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/vm/SimulatedBlock.hpp:1037
#8  FastScheduler::doJob (this=0xc4bbc0)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/vm/FastScheduler.cpp:136
#9  0x00000000006db76c in ThreadConfig::ipControlLoop (
    this=<value optimized out>, thread_index=3)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/vm/ThreadConfig.cpp:249
#10 0x0000000000431f54 in ndbd_run (foreground=true,
    report_fd=<value optimized out>,
    connect_str=0x680 <Address 0x680 out of bounds>, force_nodeid=2034820416,
    bind_address=0x7fff7948e57c "\t", no_start=<value optimized out>,
    initial=true, initialstart=false, allocated_nodeid=0)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/ndbd.cpp:708
#11 0x0000000000430dfd in real_main (argc=0, argv=0x216aa78)
    at /pb2/build/sb_0-4838533-1329327230.71/rpm/BUILD/mysql-cluster-gpl-7.2.4/mysql-cluster-gpl-7.2.4/storage/ndb/src/kernel/main.cpp:190
#12 0x00007fb3fabcbcdd in __libc_start_main () from /lib64/libc.so.6
#13 0x0000000000430889 in _start ()

Also see the duplicate bug#65141: Data node crash in error 2341 in MySQL Cluster 7.2.5

Also check bug#64278 where there is a crash in DBLQH, line 9738, (Cluster 7.2.2)
Assume this to be a duplicate also. (Line numbers a bit of due to different 7.2.x versions)

See also Bug #65667 (version 7.2.6 crash in DBLQH, Line: 9785)

Hello,

This bug has been fixed in MySQL Cluster versions 7.0.34, 7.1.23, and 7.2.7.

The fix has been documented as follow:

"An error handling routine in the local query handler used the wrong code path, which could corrupt the transaction ID hash, causing the data node process to fail. This could in some cases possibly lead to failures of other data nodes in the same node group when the failed node attempted to restart. (Bug #14083116)"

http://dev.mysql.com/doc/refman/5.5/en/mysql-cluster-news-5-5-25a-ndb-7-2-7.html

I am closing this bug now - please reopen, if needed

Best Regards,

Santo Leto
MySQL Support, EMEA Team