Bug #93045 DBLQH (Line: 17946) 0x00000002 Check lcpPtr.p->lcpPrepareState == LcpRecord::LCP
Submitted: 1 Nov 2018 15:49 Modified: 5 Nov 2018 18:58
Reporter: Andrea Bravetti Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:7.6.8 OS:Linux (CentOS Linux 7)
Assigned to: Bogdan Kecman CPU Architecture:x86
Tags: DBLQH 17946 lcpPrepareState LcpRecord LCP Forced shutdown

[1 Nov 2018 15:49] Andrea Bravetti
Description:
While increasing thread for ldm I encuntered a forced node shutdown with 4 thread.

OK with:

ThreadConfig=ldm={count=2,cpuset=2,3},recv={count=1,cpuset=4},send={count=1,cpuset=5},rep={count=1,cpuset=6},tc={count=1,cpuset=7}

Error with:

ThreadConfig=ldm={count=4,cpuset=2,3,4,5},tc={count=2,cpuset=6,7},recv={count=2,cpuset=6,7},send={count=2,cpuset=6,7},rep={count=1,cpuset=6,7}

Error in ndb_mgm:

Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Error in ndbmtd:

2018-11-01 16:24:29 [ndbd] INFO     -- Restore Database Off-line Completed
2018-11-01 16:24:29 [ndbd] INFO     -- Bring Database On-line Starting
2018-11-01 16:24:29 [ndbd] INFO     -- LDM(2): Starting to copy fragments.
2018-11-01 16:24:29 [ndbd] INFO     -- LDM(1): Starting to copy fragments.
2018-11-01 16:24:29 [ndbd] INFO     -- LDM(1): Completed copy of fragment T2F0. Changed +0/-57 rows, 3192 bytes. 5 pct churn to 1108 rows.
2018-11-01 16:24:29 [ndbd] INFO     -- LDM(2): Completed copy of fragment T2F1. Changed +0/-97 rows, 5432 bytes. 9 pct churn to 1068 rows.
2018-11-01 16:24:29 [ndbd] INFO     -- (2)Starting local LCP(1669,1)
2018-11-01 16:24:29 [ndbd] INFO     -- (1)Starting local LCP(1669,1)
2018-11-01 16:24:29 [ndbd] INFO     -- /export/home2/pb2/build/sb_1-30900976-1539000784.68/rpm/BUILD/mysql-cluster-gpl-7.6.8/mysql-cluster-gpl-7.6.8/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2018-11-01 16:24:29 [ndbd] INFO     -- DBLQH (Line: 17946) 0x00000002 Check lcpPtr.p->lcpPrepareState == LcpRecord::LCP_IDLE failed
2018-11-01 16:24:29 [ndbd] INFO     -- Error handler shutting down system
2018-11-01 16:24:30 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-11-01 16:24:30 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
Not sure is for everyone, but for me this line enable the error:

ThreadConfig=ldm={count=4,cpuset=2,3,4,5},tc={count=2,cpuset=6,7},recv={count=2,cpuset=6,7},send={count=2,cpuset=6,7},rep={count=1,cpuset=6,7}

While using only 2 thread for ldm always make the node working.
[1 Nov 2018 21:25] Andrea Bravetti
If I use cpubind instead of cpuset (ldm={count=4,cpubind=2-5}), I get a different error:

2018-11-01 21:56:46 [ndbd] INFO     -- /export/home2/pb2/build/sb_1-30900976-1539000784.68/rpm/BUILD/mysql-cluster-gpl-7.6.8/mysql-cluster-gpl-7.6.8/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2018-11-01 21:56:46 [ndbd] INFO     -- DBLQH (Line: 18014) 0x00000002 Check !isNdbMtLqh() failed
2018-11-01 21:56:46 [ndbd] INFO     -- Error handler shutting down system
2018-11-01 21:56:46 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-11-01 21:56:46 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

With 2 ldm thread, even with cpubind, it work.

It may be it's my fault, but I can't understand what's wrong.
[2 Nov 2018 13:02] Bogdan Kecman
Hi Andrea,

Please send us the full log files with trace files (ndb_error_reporter) so we can investigate

thanks
Bogdan
[3 Nov 2018 1:00] Andrea Bravetti
ndb error report

Attachment: ndb_error_report_20181103015351.tar.bz2 (application/x-bzip, text), 1.72 MiB.

[3 Nov 2018 1:01] Andrea Bravetti
Hi Bogdan,
ndb_error_report_20181103015351.tar.bz2 attached.
[3 Nov 2018 1:06] Andrea Bravetti
Bogdan,
I cleaned the log and replicated the error on node 3 just before creating the log, so node 3 log is usefull and clean.

Node 2 log is bigger and may contain a lot of useless info...
[3 Nov 2018 1:16] Andrea Bravetti
On another cluster, with smaller data and less memory but identical configuration, when I try with ldm count=4 it remain stuck after printing a lot of line like this:

2018-11-03 00:39:33 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: JobHandling in block: 247, gsn: 365 elapsed=18029
2018-11-03 00:39:33 [ndbd] INFO     -- Watchdog: User time: 2217  System time: 826

One ldm thread is at 100% cpu and stuck forever (well, not forever obviously, but for some hours while it should start in fiew minutes).

In this case too the problem is related with ldm count 4, with ldm count 2 it start normaly. I tried this server because it start faster than the other and I tried to collect log here but I incontered this (different) error so I replicated on the original, bigger, cluster and I took the logs there.
[3 Nov 2018 2:05] Andrea Bravetti
Don't know if it is important but, on both cluster, I'm using:

isolcpus=2-7

and:

IRQBALANCE_BANNED_CPUS=fc
[3 Nov 2018 11:01] Andrea Bravetti
I just noticed there is a trivial error in my config.ini:

2018-11-03 01:37:52 [MgmtSrvr] INFO     -- Loaded config from '/usr/mysql-cluster/ndb_1_config.bin.6'
2018-11-03 01:37:52 [MgmtSrvr] WARNING  -- at line 33: ThreadConfig overrides MaxNoOfExecutionThreads
2018-11-03 01:37:52 [MgmtSrvr] WARNING  -- at line 39: ThreadConfig overrides MaxNoOfExecutionThreads

However the problem persist even without MaxNoOfExecutionThreads.
[5 Nov 2018 1:47] Bogdan Kecman
Hi,

after changing LDM you would at least have to do initial restart of all nodes as  LDM(3) and LDM(4) are not being properly initialised since they have no fragments stored in them. From that point this is not a bug, but I'll go trough traces you provided to see if there's anything else there.

Thanks
Bogdan
[5 Nov 2018 18:48] Andrea Bravetti
Bogdan,
you are right, it was my fault, I'm sorry.

As you can see from node 2 log I tried with initial one time but that run was aborted and next start was not initial.

Not too simple to understand the error, it may be usefull to give some usefull warning on start instead that error at the end, but definitely not a bug with ldm count.
[5 Nov 2018 18:58] Bogdan Kecman
Hi,

> As you can see from node 2 log I tried with initial one time but that 
> run was aborted and next start was not initial.

Yes, so .. you had the good idea but...

> Not too simple to understand the error, it may be usefull to 
> give some usefull warning on start instead that error at the
> end, but definitely not a bug with ldm count.

Error messages in ndbcluster are something that can use a lot of work, we can all agree with that, especially these advanced functionalities; but we'r kinda pushing on stability and usability still adding more features to increase usability relying on users trying advanced options to either figure it out or to bounce info trough support/forum/bug system... with some additional info pushed trough our devs blogs.. so not ideal situation but will get better :)

thanks
Bogdan
[5 Nov 2018 19:19] Bogdan Kecman
btw note that it's possible that in certain situation to increase you even have to do a backup/restore as initial restart might fail