Bug #46914 cluster crash, SimulatedBlock.cpp DBTUP (Line: 662), failed ndbrequire
Submitted: 25 Aug 12:11 Modified: 9 Nov 20:09
Reporter: Bogdan Kecman
Status: Verified
Category:Server: Cluster Severity:S2 (Serious)
Version:mysql-5.1-telco-7.0 OS:Any
Assigned to: Jonas Oreland Target Version:
Tags: mysql-5.1.34 ndb-7.0.6
Triage: Triaged: D2 (Serious) / R3 (Medium) / E4 (High)

[25 Aug 12:11] Bogdan Kecman
Description:
Without any known reason half of the data nodes crash with:

2009-08-24 16:55:52 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed
. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal err
or, programming error or missing error message, please report a bug). Temporary
error, restart node'.

and the other half dies with:
2009-08-24 16:55:55 [MgmSrvr] ALERT -- Node 6: Forced node shutdown completed
. Caused by error 2305: 'Node lost connection to other nodes and can not form a
unpartitioned cluster, please investigate if there are error(s) on other node(s)
(Arbitration error). Temporary error, restart node'.

The datanode log show:
...
...
2009-08-24 16:55:52 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling
elapsed=500
2009-08-24 16:55:52 [ndbd] INFO     -- Watchdog: User time: 2285505  System time: 714551
2009-08-24 16:55:52 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: Job Handling
elapsed=500
2009-08-24 16:55:52 [ndbd] INFO     -- Watchdog: User time: 2285505  System time: 714551
2009-08-24 16:55:52 [ndbd] WARNING  -- Ndb kernel thread 3 is stuck in: Job Handling
elapsed=600
2009-08-24 16:55:52 [ndbd] INFO     -- Watchdog: User time: 2285510  System time: 714559
...
...
Warning: 1 thread(s) did not stop before starting crash dump.
Warning: 1 thread(s) did not stop before starting crash dump.
2009-08-24 16:55:53 [ndbd] INFO     -- SimulatedBlock.cpp
2009-08-24 16:55:53 [ndbd] INFO     -- DBTUP (Line: 662) 0x0000000a
2009-08-24 16:55:53 [ndbd] INFO     -- Watchdog shutting down system
2009-08-24 16:55:53 [ndbd] INFO     -- Watchdog shutdown completed - exiting
2009-08-24 16:55:54 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by
error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error
or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
2 management nodes
6 data nodes (multithreaded)
20G data memory
2G index memory
mysql-5.1.34 ndb-7.0.6

Suggested fix:
.
[27 Aug 14:27] Bogdan Kecman
Looks like ndbmtd reaches the LongMessageBuffer limit faster then ndbd, so incresing
LongMessageBuffer from default 4M to 8M or more should solve the problem.
[28 Sep 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[28 Oct 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[9 Nov 20:09] Andrew Hutchings
Test case:
Need minimum 2xndbmtd, 1xmysqld (with 4 [mysqld] sections in config.ini)

config.ini:
LongMessageBuffer=512K
MaxNoOfExecutionThreads=4

my.cnf:
ndb-cluster-connection-pool=4
log-bin

shell> mysqlslap -uroot --auto-generate-sql -endb -c4 -x4 --number-of-queries=10000
--commit=10
[10 Nov 13:11] Jonas Oreland
So the problem is *with* replication and ndbtmd
In ndbd, the commit triggers fire and puts data directly into SUMA buffer
But with ndbmtd, LQH(s) and SUMA runs in different threads, so this is not possible,
therefor the LongMessageBuffer is used to pass the data.

But that resource can be exhausted, causing the crash.
In ndbd 
1) it's has it's own memory manager (using DataMemory)
2) if that is exhauseted, it's handled "gracefully"
   (datanodes stay alive, but replication gets GAP event)

---

So a solution, will have to
1) use a different memory pool to pass data between LQH/SUMA (based on DM)
2) handle out of memory gracefully
[10 Nov 13:24] Andrew Hutchings
Yes, without log-bin I hit bug#48441 instead (until that was fixed).  It was very easy to
hit this bug once log-bin was turned on.