MySQL Bugs: #110846: NDBD node shutdown forced by error 2334

Bug #110846	NDBD node shutdown forced by error 2334 - Job Buffer Full
Submitted:	27 Apr 2023 13:48	Modified:	11 May 2023 17:14
Reporter:	Tomasz Cios	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	8.0.33	OS:	Red Hat (8.2)
Assigned to:		CPU Architecture:	x86

Description:

For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
/sbin/ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x8da451]
/sbin/ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x33) [0x83cc63]
/sbin/ndbd() [0x8bffc5]
/sbin/ndbd(SimulatedBlock::sendSignal(NodeReceiverGroup, unsigned short, SignalT<25u>*, unsigned int, JobBufferLevel, SectionHandle*) const+0x348) [0x8cd848]
/sbin/ndbd(SimulatedBlock::sendFirstFragment(SimulatedBlock::FragmentSendInfo&, NodeReceiverGroup, unsigned short, Signal*, unsigned int, JobBufferLevel, SectionHandle*, bool, unsigned int)+0x1b5) [0x8cf065]
/sbin/ndbd(SimulatedBlock::sendBatchedFragmentedSignal(unsigned int, unsigned short, Signal*, unsigned int, JobBufferLevel, SectionHandle*, bool, SimulatedBlock::Callback&, unsigned int)+0x153) [0x8cf323]
/sbin/ndbd(Dbtup::sendBatchedFIRE_TRIG_ORD(Signal*, unsigned int, unsigned int, SectionHandle*)+0xa1) [0x72aa31]
/sbin/ndbd(Dbtup::executeTrigger(Dbtup::KeyReqStruct*, Dbtup::TupTriggerData*, Dbtup::Operationrec*, bool)+0x7d1) [0x72cca1]
/sbin/ndbd(Dbtup::fireDeferredTriggers(Dbtup::KeyReqStruct*, DLFifoList<ArrayPool<Dbtup::TupTriggerData>, (IntrusiveTags)0, TaggedDoubleLinkMethods<Dbtup::TupTriggerData, (IntrusiveTags)0> >&, Dbtup::Operationrec*, bool)+0x142) [0x72e4e2]
/sbin/ndbd(Dbtup::checkDeferredTriggers(Dbtup::KeyReqStruct*, Dbtup::Operationrec*, Dbtup::Tablerec*, bool)+0x121) [0x72e671]
/sbin/ndbd(Dbtup::execFIRE_TRIG_REQ(Signal*)+0x2d1) [0x72ebc1]
/sbin/ndbd(Dblqh::execFIRE_TRIG_REQ(Signal*)+0x14a) [0x620f4a]
/sbin/ndbd(Dblqh::execPACKED_SIGNAL(Signal*)+0x176) [0x658f36]
/sbin/ndbd(SimulatedBlock::executeFunction_async(unsigned short, Signal*)+0x61) [0x8bf931]
/sbin/ndbd(FastScheduler::doJob(unsigned int)+0x128) [0x8c0698]
/sbin/ndbd(ThreadConfig::ipControlLoop(NdbThread*)+0x556) [0x8d11f6]
/sbin/ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x7f5) [0x4fc7f5]
/sbin/ndbd(real_main(int, char**)+0x403) [0x4fa903]
/sbin/ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0x1242) [0x4fa032]
/sbin/ndbd(real_main(int, char**)+0x35a) [0x4fa85a]
/sbin/ndbd(main+0x3b) [0x4f500b]
/lib64/libc.so.6(__libc_start_main+0xf3) [0x7f84bccff6a3]
/sbin/ndbd(_start+0x2e) [0x4f7aae]
2023-04-27 15:33:36 [ndbd] INFO     -- Job Buffer Full
2023-04-27 15:33:36 [ndbd] INFO     -- APZJobBuffer.C
2023-04-27 15:33:36 [ndbd] INFO     -- Error handler shutting down system
2023-04-27 15:33:36 [ndbd] INFO     -- Error handler shutdown completed - exiting
2023-04-27 15:33:36 [ndbd] ALERT    -- Node 1: Forced node shutdown completed. Caused by error 2334: 'Job buffer congestion(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
I'm importing quite large amount of rows from csv to NDB.
For this reason I split it to 10k chunks.
I use "for" loop in bash to start ndb_import on after the other to import all portions.
After several chunks I get an error:

2023-04-27 15:33:36 [NdbApi] INFO     -- Node 1 disconnected in recv with errnum: 104 in state: 0

Repeated 2-3 times on 8.0.29, upgraded to 8.0.33 and got the same error

One remark: with --opbatch it seems to work fine.
I tried with --opbatch=10 and --opbatch=20 - performance is still pretty good on my test environment and ndbd does not brake.
I have not tried to find the value of this param when it fails.

Hi,

I cannot reproduce this and this does not look like a bug but improper configuration of the ndbcluster

Can you provide a reproducible test case?

Thanks

Hi,

> Is there a downloadable image of a VM with "properly configured ndbcluster"?

No, and it would not work as MySQL Cluster need to be sized properly to your application / way you are using it. There is no "fit all" configuration with MySQL Cluster especially as it is designed to crash when it cannot keep up with tasks rather then slow down (what InnoDB would do). That is also why support and consulting for ndbcluster is on a whole other level compared to enterprise MySQL (InnoDB).

> If that is not possible - which improper setting could lead to this error?

As you already noted, limiting number of operations per batch with opbatch solved a problem. Check out parameters about max operations configuring data node: https://dev.mysql.com/doc/refman/8.0/en/mysql-cluster-params-ndbd.html and increase to match your needs.

Hi,

Discussed with colleagues, I do not believe this is a bug but it could be in theory.

You are using ndbd (single threaded data node), you would get better results with ndbmtd (multithreaded data node). You have triggers (crash show deferred triggers are involved, ones used with FK's)

If you can share your schema I can try to reproduce this behavior and detect if we actually do have a bug or it is just about sizing the cluster.

Workaround is as you already noticed - to use smaller transactions.

Also, changing triggers from NO ACTION to RESTRICT could help too

Hi,

Thanks for the data, I managed to reproduce the problem. NDB team will take from now to see if they can point the reason for the crash and fix it. 

all best