Bug #93045 | DBLQH (Line: 17946) 0x00000002 Check lcpPtr.p->lcpPrepareState == LcpRecord::LCP | ||
---|---|---|---|
Submitted: | 1 Nov 2018 15:49 | Modified: | 5 Nov 2018 18:58 |
Reporter: | Andrea Bravetti | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 7.6.8 | OS: | Linux (CentOS Linux 7) |
Assigned to: | MySQL Verification Team | CPU Architecture: | x86 |
Tags: | DBLQH 17946 lcpPrepareState LcpRecord LCP Forced shutdown |
[1 Nov 2018 15:49]
Andrea Bravetti
[1 Nov 2018 21:25]
Andrea Bravetti
If I use cpubind instead of cpuset (ldm={count=4,cpubind=2-5}), I get a different error: 2018-11-01 21:56:46 [ndbd] INFO -- /export/home2/pb2/build/sb_1-30900976-1539000784.68/rpm/BUILD/mysql-cluster-gpl-7.6.8/mysql-cluster-gpl-7.6.8/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp 2018-11-01 21:56:46 [ndbd] INFO -- DBLQH (Line: 18014) 0x00000002 Check !isNdbMtLqh() failed 2018-11-01 21:56:46 [ndbd] INFO -- Error handler shutting down system 2018-11-01 21:56:46 [ndbd] INFO -- Error handler shutdown completed - exiting 2018-11-01 21:56:46 [ndbd] ALERT -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. With 2 ldm thread, even with cpubind, it work. It may be it's my fault, but I can't understand what's wrong.
[2 Nov 2018 13:02]
MySQL Verification Team
Hi Andrea, Please send us the full log files with trace files (ndb_error_reporter) so we can investigate thanks Bogdan
[3 Nov 2018 1:00]
Andrea Bravetti
ndb error report
Attachment: ndb_error_report_20181103015351.tar.bz2 (application/x-bzip, text), 1.72 MiB.
[3 Nov 2018 1:01]
Andrea Bravetti
Hi Bogdan, ndb_error_report_20181103015351.tar.bz2 attached.
[3 Nov 2018 1:06]
Andrea Bravetti
Bogdan, I cleaned the log and replicated the error on node 3 just before creating the log, so node 3 log is usefull and clean. Node 2 log is bigger and may contain a lot of useless info...
[3 Nov 2018 1:16]
Andrea Bravetti
On another cluster, with smaller data and less memory but identical configuration, when I try with ldm count=4 it remain stuck after printing a lot of line like this: 2018-11-03 00:39:33 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: JobHandling in block: 247, gsn: 365 elapsed=18029 2018-11-03 00:39:33 [ndbd] INFO -- Watchdog: User time: 2217 System time: 826 One ldm thread is at 100% cpu and stuck forever (well, not forever obviously, but for some hours while it should start in fiew minutes). In this case too the problem is related with ldm count 4, with ldm count 2 it start normaly. I tried this server because it start faster than the other and I tried to collect log here but I incontered this (different) error so I replicated on the original, bigger, cluster and I took the logs there.
[3 Nov 2018 2:05]
Andrea Bravetti
Don't know if it is important but, on both cluster, I'm using: isolcpus=2-7 and: IRQBALANCE_BANNED_CPUS=fc
[3 Nov 2018 11:01]
Andrea Bravetti
I just noticed there is a trivial error in my config.ini: 2018-11-03 01:37:52 [MgmtSrvr] INFO -- Loaded config from '/usr/mysql-cluster/ndb_1_config.bin.6' 2018-11-03 01:37:52 [MgmtSrvr] WARNING -- at line 33: ThreadConfig overrides MaxNoOfExecutionThreads 2018-11-03 01:37:52 [MgmtSrvr] WARNING -- at line 39: ThreadConfig overrides MaxNoOfExecutionThreads However the problem persist even without MaxNoOfExecutionThreads.
[5 Nov 2018 1:47]
MySQL Verification Team
Hi, after changing LDM you would at least have to do initial restart of all nodes as LDM(3) and LDM(4) are not being properly initialised since they have no fragments stored in them. From that point this is not a bug, but I'll go trough traces you provided to see if there's anything else there. Thanks Bogdan
[5 Nov 2018 18:48]
Andrea Bravetti
Bogdan, you are right, it was my fault, I'm sorry. As you can see from node 2 log I tried with initial one time but that run was aborted and next start was not initial. Not too simple to understand the error, it may be usefull to give some usefull warning on start instead that error at the end, but definitely not a bug with ldm count.
[5 Nov 2018 18:58]
MySQL Verification Team
Hi, > As you can see from node 2 log I tried with initial one time but that > run was aborted and next start was not initial. Yes, so .. you had the good idea but... > Not too simple to understand the error, it may be usefull to > give some usefull warning on start instead that error at the > end, but definitely not a bug with ldm count. Error messages in ndbcluster are something that can use a lot of work, we can all agree with that, especially these advanced functionalities; but we'r kinda pushing on stability and usability still adding more features to increase usability relying on users trying advanced options to either figure it out or to bounce info trough support/forum/bug system... with some additional info pushed trough our devs blogs.. so not ideal situation but will get better :) thanks Bogdan
[5 Nov 2018 19:19]
MySQL Verification Team
btw note that it's possible that in certain situation to increase you even have to do a backup/restore as initial restart might fail