MySQL Bugs: #100639: Forced node shutdown Caused by error 2334: 'Job buffer congestion'

Bug #100639	Forced node shutdown Caused by error 2334: 'Job buffer congestion'
Submitted:	26 Aug 2020 5:11	Modified:	15 Sep 2020 5:46
Reporter:	John Wu	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	8.0.20	OS:	Red Hat
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
Normally at midnight or early morning when non peak usage, the db will shutdown with the errors:

2020-08-24 23:19:12 [MgmtSrvr] INFO     -- Node 4 disconnected in recv with errnum: 104 in state: 0
2020-08-24 23:19:12 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2020-08-24 23:19:12 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2334: 'Job buffer congestion(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2020-08-24 23:19:12 [MgmtSrvr] ALERT    -- Node 3: Node 4 Disconnected
2020-08-24 23:19:12 [MgmtSrvr] INFO     -- Node 3: Communication to Node 4 closed

The second node will also shutdown after awhile (could be within the hour or a couple of hours later)

Message from ndb_3_out.log
2020-08-24 23:21:39 [ndbd] INFO     -- timerHandlingLab, expected 10ms sleep, not scheduled for: 150 (ms), exec_time 158243 us, sys_time 0 us
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x889d7e]
ndbd(ndb_print_stacktrace()+0x45) [0x843145]
ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x20) [0x802980]
ndbd() [0x8573f1]
ndbd(SimulatedBlock::sendSignal(unsigned int, unsigned short, Signal*, unsigned int, JobBufferLevel, LinearSectionPtr*, unsigned int) const+0x2b5) [0x8604b5]
ndbd(Dbtup::sendReadAttrinfo(Signal*, Dbtup::KeyReqStruct*, unsigned int)+0x2a1) [0x7e3171]
ndbd(Dbtup::interpreterStartLab(Signal*, Dbtup::KeyReqStruct*)+0x1f9) [0x6cf0b9]
ndbd(Dbtup::execTUPKEYREQ(Signal*)+0x2db) [0x6d390b]
ndbd(Dblqh::next_scanconf_tupkeyreq(Signal*, Dblqh::ScanRecord*, Dblqh::TcConnectionrec*, Dblqh::Fragrecord*, unsigned int)+0xec) [0x646eac]
ndbd(Dblqh::continue_next_scan_conf(Signal*, Dblqh::ScanRecord::ScanState, Dblqh::ScanRecord*)+0x37) [0x64cdb7]
ndbd(Dbtux::continue_scan(Signal*, Ptr<Dbtux::ScanOp>, Dbtux::Frag&, bool)+0x19f) [0x71491f]
ndbd(Dbtux::execNEXT_SCANREQ(Signal*)+0x68) [0x715ab8]
ndbd(Dblqh::send_next_NEXT_SCANREQ(Signal*, SimulatedBlock*, void (SimulatedBlock::*)(Signal*), Dblqh::ScanRecord*, unsigned int)+0xd2) [0x60f5d2]
ndbd(Dblqh::scanNextLoopLab(Signal*, unsigned int, unsigned int, Dblqh::ScanRecord*, Dblqh::Fragrecord*)+0x109) [0x60f8d9]
ndbd(Dblqh::continue_next_scan_conf(Signal*, Dblqh::ScanRecord::ScanState, Dblqh::ScanRecord*)+0x37) [0x64cdb7]
ndbd(Dbtux::continue_scan(Signal*, Ptr<Dbtux::ScanOp>, Dbtux::Frag&, bool)+0x19f) [0x71491f]
ndbd(Dbtux::execNEXT_SCANREQ(Signal*)+0x68) [0x715ab8]
ndbd(Dblqh::execACC_CHECK_SCAN(Signal*)+0xb9) [0x60c179]
ndbd(FastScheduler::doJob(unsigned int)+0x1d5) [0x857af5]
ndbd(ThreadConfig::ipControlLoop(NdbThread*)+0x67a) [0x866c1a]
ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x756) [0x4f0e76]
ndbd(real_main(int, char**)+0x2d8) [0x4ef028]
ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0xf20) [0x4eeb90]
ndbd(real_main(int, char**)+0x346) [0x4ef096]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x7ff5f7b65b35]
ndbd() [0x4ecc00]
2020-08-24 23:21:39 [ndbd] INFO     -- Job Buffer Full
2020-08-24 23:21:39 [ndbd] INFO     -- APZJobBuffer.C
2020-08-24 23:21:39 [ndbd] INFO     -- Error handler shutting down system
2020-08-24 23:21:39 [ndbd] INFO     -- Error handler shutdown completed - exiting
2020-08-24 23:21:40 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2334: 'Job buffer congestion(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Config file:
[ndb_mgmd default]
DataDir=/usr/local/mysql/mysql-cluster

[ndb_mgmd]
HostName=10.19.188.50

[ndbd default]
NoOfReplicas=2      # Number of replicas

DataMemory=100G
MaxNoOfTables=1024
MaxNoOfAttributes=10000
MaxNoOfConcurrentOperations=500000
MaxNoOfOrderedIndexes=5000
MaxNoOfTriggers=1000
NoOfFragmentLogFiles=300

[ndbd]
HostName=10.19.188.53
NodeId=3

[ndbd]
HostName=10.19.188.54
NodeId=4

[mysqld]
HostName=10.19.188.53

[mysqld]
HostName=10.19.188.54

How to repeat:
Not able to repeat it but it happens regularly every couple of days

Hi,

For start, we really need FULL set of logs (ndb_error_reporter will create that for you), compress it and use "files" tab info to upload to us. Make sure you specify exact filename in the reply here (I cannot dl the file without exact name).

Now, with regards the "at midnight or early morning when non peak usage" this in 99.9% cases means that you have some cron job or similar that's set to run at "non peak hours" that is killing your cluster. Please investigate what is it that you are running at midnight/morning, cron.daily jobs, cron.weekly, cron.monthly ... usually some run at midnight, 2am, 3am, 4am .. something from there is the culprit. Cluster, of course, should not crash, but finding what's causing the crash will help us solve the problem

kind regards
Bogdan

Hi
I have uploaded the file. File name is mysql-bug-data-100639.zip
Hope you can help out what's the issue on my case.
Thanks very much!

Hi,

Thanks for the logs. This looks like your server was overloaded.

Have you checked your cron schedule? Do you have server monitoring, can you check CPU usage during period when the crash happened?

Thanks

htop

Attachment: htop.png (image/png, text), 522.54 KiB.

Hi thanks for your prompt reply. If you can advise, I have a 32 core server and 252 GB of RAM and I thought it is already too much, But now I see my ndbd process is running at 90 over percent, when practically no one is accessing the system now. Can you advise what I can do on this? Thanks.

I have no crontab -e job except for a mysql backup job running at 4am every morning, and only on one of the server.

Hi,

I cannot reproduce the problem but looking at the logs it looks like when the crash happens your system is overloaded. What is loading the system I cannot tell from the logs.

You might want to setup some monitoring tools like MySQL Enterprise Monitor or at least Zabbix or Nagios or if nothing else sysstat so you can look at sar reports. But with existing data there's nothing I can do to detect why your crash is happening.

All best
Bogdan

Thank you very much for your attention on this matter.

Best regards.