MySQL Bugs: #46914: cluster crash, SimulatedBlock.cpp DBTUP (Line: 662), failed ndbrequire

Bug #46914	cluster crash, SimulatedBlock.cpp DBTUP (Line: 662), failed ndbrequire
Submitted:	25 Aug 2009 10:11	Modified:	18 Feb 2010 8:45
Reporter:	Bogdan Kecman	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.1-telco-7.0	OS:	Any
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	mysql-5.1.34 ndb-7.0.6

Description:
Without any known reason half of the data nodes crash with:

2009-08-24 16:55:52 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed
. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal err
or, programming error or missing error message, please report a bug). Temporary
error, restart node'.

and the other half dies with:
2009-08-24 16:55:55 [MgmSrvr] ALERT -- Node 6: Forced node shutdown completed
. Caused by error 2305: 'Node lost connection to other nodes and can not form a
unpartitioned cluster, please investigate if there are error(s) on other node(s)
(Arbitration error). Temporary error, restart node'.

The datanode log show:
...
...
2009-08-24 16:55:52 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=500
2009-08-24 16:55:52 [ndbd] INFO     -- Watchdog: User time: 2285505  System time: 714551
2009-08-24 16:55:52 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Job Handling elapsed=500
2009-08-24 16:55:52 [ndbd] INFO     -- Watchdog: User time: 2285505  System time: 714551
2009-08-24 16:55:52 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling elapsed=600
2009-08-24 16:55:52 [ndbd] INFO     -- Watchdog: User time: 2285510  System time: 714559
...
...
Warning: 1 thread(s) did not stop before starting crash dump.
Warning: 1 thread(s) did not stop before starting crash dump.
2009-08-24 16:55:53 [ndbd] INFO     -- SimulatedBlock.cpp
2009-08-24 16:55:53 [ndbd] INFO     -- DBTUP (Line: 662) 0x0000000a
2009-08-24 16:55:53 [ndbd] INFO     -- Watchdog shutting down system
2009-08-24 16:55:53 [ndbd] INFO     -- Watchdog shutdown completed - exiting
2009-08-24 16:55:54 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
2 management nodes
6 data nodes (multithreaded)
20G data memory
2G index memory
mysql-5.1.34 ndb-7.0.6

Suggested fix:
.

Looks like ndbmtd reaches the LongMessageBuffer limit faster then ndbd, so incresing LongMessageBuffer from default 4M to 8M or more should solve the problem.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Test case:
Need minimum 2xndbmtd, 1xmysqld (with 4 [mysqld] sections in config.ini)

config.ini:
LongMessageBuffer=512K
MaxNoOfExecutionThreads=4

my.cnf:
ndb-cluster-connection-pool=4
log-bin

shell> mysqlslap -uroot --auto-generate-sql -endb -c4 -x4 --number-of-queries=10000 --commit=10

So the problem is *with* replication and ndbtmd
In ndbd, the commit triggers fire and puts data directly into SUMA buffer
But with ndbmtd, LQH(s) and SUMA runs in different threads, so this is not possible, therefor the LongMessageBuffer is used to pass the data.

But that resource can be exhausted, causing the crash.
In ndbd 
1) it's has it's own memory manager (using DataMemory)
2) if that is exhauseted, it's handled "gracefully"
   (datanodes stay alive, but replication gets GAP event)

---

So a solution, will have to
1) use a different memory pool to pass data between LQH/SUMA (based on DM)
2) handle out of memory gracefully

Yes, without log-bin I hit bug#48441 instead (until that was fixed).  It was very easy to hit this bug once log-bin was turned on.

Note that the error message for this is now changing to:

Status: Permanent error, external action needed
Message: Signal lost, out of long signal memory, please increase LongMessageBuffer (Resource configuration error)
Error: 2351
Error data: Out of LongMessageBuffer in sendSignal

tentative patch

Attachment: tot.patch (application/octet-stream, text), 14.54 KiB.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/100698

3412 Jonas Oreland	2010-02-18
      ndb - bug#46914 - pass replication data from TUP to SUMA in datamemory rather than in longmessagebuffer

pushed to 7.0.13 and 7.1.2

Documented bugfix in the NDB-7.0.13 and 7.1.2 changelogs as follows:

      Replication of a MySQL Cluster using multi-threaded data nodes 
      could fail with forced shutdowns of some data nodes due to the 
      fact that ndbmtd exhausted the LongMessageBuffer much more quickly 
      than ndbd. After this fix, passing of replication data between the 
      DBTUP and SUMA NDB kernel blocks is done using DataMemory rather 
      than LongMessageBuffer.

      Until you can upgrade, you may be able to work around this issue 
      by increasing the LongMessageBuffer setting; doubling the default 
      value should be sufficient in most cases.

Also noted this issue and workaround in the LongMessageBuffer description in the Manual.

Closed.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/101329

3407 Martin Skold	2010-02-24 [merge]
      Merge
      added:
        storage/ndb/cmake/
        storage/ndb/cmake/cmake_parse_arguments.cmake
        storage/ndb/cmake/libutils.cmake
      modified:
        .bzrignore
        mysql-test/suite/ndb/r/ndb_dd_alter.result
        mysql-test/suite/ndb/r/ndb_dd_restore_compat.result
        mysql-test/suite/ndb/t/show_attributes.inc
        storage/ndb/CMakeLists.txt
        storage/ndb/Makefile.am
        storage/ndb/config/type_ndbapitools.cmake
        storage/ndb/include/kernel/GlobalSignalNumbers.h
        storage/ndb/include/util/Parser.hpp
        storage/ndb/src/CMakeLists.txt
        storage/ndb/src/kernel/blocks/CMakeLists.txt
        storage/ndb/src/kernel/blocks/dbacc/Dbacc.hpp
        storage/ndb/src/kernel/blocks/dbacc/DbaccInit.cpp
        storage/ndb/src/kernel/blocks/dbacc/DbaccMain.cpp
        storage/ndb/src/kernel/blocks/dblqh/Dblqh.hpp
        storage/ndb/src/kernel/blocks/dblqh/DblqhInit.cpp
        storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
        storage/ndb/src/kernel/blocks/dbtup/Dbtup.hpp
        storage/ndb/src/kernel/blocks/dbtup/DbtupGen.cpp
        storage/ndb/src/kernel/blocks/dbtup/DbtupTrigger.cpp
        storage/ndb/src/kernel/blocks/suma/Suma.cpp
        storage/ndb/src/kernel/blocks/suma/Suma.hpp
        storage/ndb/src/kernel/blocks/suma/SumaInit.cpp
        storage/ndb/src/kernel/vm/ArrayPool.hpp
        storage/ndb/src/kernel/vm/ndbd_malloc_impl.cpp
        storage/ndb/src/mgmsrv/CMakeLists.txt
        storage/ndb/src/mgmsrv/MgmtSrvr.cpp
        storage/ndb/src/mgmsrv/MgmtSrvr.hpp
        storage/ndb/src/mgmsrv/Services.cpp
        storage/ndb/src/ndbapi/NdbDictionary.cpp
        storage/ndb/test/include/HugoTransactions.hpp
        storage/ndb/test/include/NdbRestarter.hpp
        storage/ndb/test/ndbapi/testMgm.cpp
        storage/ndb/test/ndbapi/testUpgrade.cpp
        storage/ndb/test/run-test/command.cpp
        storage/ndb/test/run-test/db.cpp
        storage/ndb/test/run-test/upgrade-tests.txt
        storage/ndb/test/src/CpcClient.cpp
        storage/ndb/test/src/HugoTransactions.cpp
        storage/ndb/test/src/NdbRestarter.cpp
        storage/ndb/tools/CMakeLists.txt
        storage/ndb/tools/desc.cpp