Bug #20843 tests fails randomly with assertion in completeClusterFailed
Submitted: 4 Jul 2006 8:40 Modified: 6 Jul 2006 9:02
Reporter: Tomas Ulin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1 OS:
Assigned to: Tomas Ulin CPU Architecture:Any

[4 Jul 2006 8:40] Tomas Ulin
Description:
fails randomply in push build
assertion completeClusterFailed

How to repeat:
fails randomply in push build
assertion completeClusterFailed
[4 Jul 2006 9:23] Kristian Nielsen
This one is difficult to repeat, but not impossible.

It occurs in Pushbuild, quite often in the Valgrind build, but also occasionally on other hosts.

I was able to repeat by running the test in a loop in Valgrind:

(cd mysql-test && for i in `seq 1 100`; do echo XXX $i XXX; MTR_BUILD_THREAD=4 perl mysql-test-run.pl --tmpdir=/dev/shm/t4 --vardir=/dev/shm/v4 --timer --ps-protocol --mysqld=--binlog-format=row --valgrind-all ndb_autodiscover3 | tee /tmp/1; fgrep -q '[ fail ]' /tmp/1 && exit 1; done)

(it failed on the 9th run).

I do not think the problem is caused by Valgrind, just that it happens more often in Valgrind, perhaps due to different thread scheduling. The same crash is seen on most/all hosts in pushbuild, just much less frequently.

From the master1.err log:

060703 17:33:15 [ERROR] /usr/local/mysql/mysql-5.1-pristine/sql/mysqld: Incorrect information in file: './test/t2.frm'
060703 17:33:16 [Note] NDB Binlog: CREATE TABLE Event: REPL$test/t2
060703 17:33:16 [Note] NDB Binlog: logging ./test/t2
out of order bucket detected at cluster disconnect, data.gci: 27.  tmp->m_gci: 6
mysqld: NdbEventOperationImpl.cpp:1634: void NdbEventBuffer::completeClusterFailed(): Assertion `false' failed.

A stasck trace from Valgrind:

==10880== Thread 2:
==10880== Conditional jump or move depends on uninitialised value(s)
==10880==    at 0x410264A: vfprintf (in /lib/tls/libc-2.3.6.so)
==10880==    by 0x4100C99: buffered_vfprintf (in /lib/tls/libc-2.3.6.so)
==10880==    by 0x4100F5D: vfprintf (in /lib/tls/libc-2.3.6.so)
==10880==    by 0x4109D61: fprintf (in /lib/tls/libc-2.3.6.so)
==10880==    by 0x840FD9E: print_stacktrace (stacktrace.c:158)
==10880==    by 0x824D3BD: handle_segfault (mysqld.cc:2145)
==10880==    by 0x4052657: (within /lib/tls/libpthread-2.3.6.so)
==10880==    by 0x40EF06A: abort (in /lib/tls/libc-2.3.6.so)
==10880==    by 0x40E6734: __assert_fail (in /lib/tls/libc-2.3.6.so)
==10880==    by 0x86958F4: NdbEventBuffer::completeClusterFailed() (NdbEventOperationImpl.cpp:1634)
==10880==    by 0x867844D: Ndb::report_node_failure_completed(unsigned) (Ndbif.cpp:264)
==10880==    by 0x8678523: Ndb::statusMessage(void*, unsigned, bool, bool) (Ndbif.cpp:224)
==10880==    by 0x868497D: TransporterFacade::ReportNodeFailureComplete(unsigned short) (TransporterFacade.cpp:834)
==10880==    by 0x86CB55B: ClusterMgr::execNF_COMPLETEREP(unsigned const*) (ClusterMgr.cpp:393)
==10880==    by 0x86CB700: ClusterMgr::reportNodeFailed(unsigned short) (ClusterMgr.cpp:474)
==10880==    by 0x86CCAD7: ClusterMgr::reportDisconnected(unsigned short) (ClusterMgr.cpp:436)
[5 Jul 2006 13:46] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8766
[5 Jul 2006 17:53] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8793
[5 Jul 2006 21:43] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/8802