MySQL Bugs: #93045: DBLQH (Line: 17946) 0x00000002 Check lcpPtr.p->lcpPrepareState == LcpRecord::LCP

Bug #93045	DBLQH (Line: 17946) 0x00000002 Check lcpPtr.p->lcpPrepareState == LcpRecord::LCP
Submitted:	1 Nov 2018 15:49	Modified:	5 Nov 2018 18:58
Reporter:	Andrea Bravetti	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.6.8	OS:	Linux (CentOS Linux 7)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86
Tags:	DBLQH 17946 lcpPrepareState LcpRecord LCP Forced shutdown

Description:
While increasing thread for ldm I encuntered a forced node shutdown with 4 thread.

OK with:

ThreadConfig=ldm={count=2,cpuset=2,3},recv={count=1,cpuset=4},send={count=1,cpuset=5},rep={count=1,cpuset=6},tc={count=1,cpuset=7}

Error with:

ThreadConfig=ldm={count=4,cpuset=2,3,4,5},tc={count=2,cpuset=6,7},recv={count=2,cpuset=6,7},send={count=2,cpuset=6,7},rep={count=1,cpuset=6,7}

Error in ndb_mgm:

Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Error in ndbmtd:

2018-11-01 16:24:29 [ndbd] INFO     -- Restore Database Off-line Completed
2018-11-01 16:24:29 [ndbd] INFO     -- Bring Database On-line Starting
2018-11-01 16:24:29 [ndbd] INFO     -- LDM(2): Starting to copy fragments.
2018-11-01 16:24:29 [ndbd] INFO     -- LDM(1): Starting to copy fragments.
2018-11-01 16:24:29 [ndbd] INFO     -- LDM(1): Completed copy of fragment T2F0. Changed +0/-57 rows, 3192 bytes. 5 pct churn to 1108 rows.
2018-11-01 16:24:29 [ndbd] INFO     -- LDM(2): Completed copy of fragment T2F1. Changed +0/-97 rows, 5432 bytes. 9 pct churn to 1068 rows.
2018-11-01 16:24:29 [ndbd] INFO     -- (2)Starting local LCP(1669,1)
2018-11-01 16:24:29 [ndbd] INFO     -- (1)Starting local LCP(1669,1)
2018-11-01 16:24:29 [ndbd] INFO     -- /export/home2/pb2/build/sb_1-30900976-1539000784.68/rpm/BUILD/mysql-cluster-gpl-7.6.8/mysql-cluster-gpl-7.6.8/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2018-11-01 16:24:29 [ndbd] INFO     -- DBLQH (Line: 17946) 0x00000002 Check lcpPtr.p->lcpPrepareState == LcpRecord::LCP_IDLE failed
2018-11-01 16:24:29 [ndbd] INFO     -- Error handler shutting down system
2018-11-01 16:24:30 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-11-01 16:24:30 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
Not sure is for everyone, but for me this line enable the error:

ThreadConfig=ldm={count=4,cpuset=2,3,4,5},tc={count=2,cpuset=6,7},recv={count=2,cpuset=6,7},send={count=2,cpuset=6,7},rep={count=1,cpuset=6,7}

While using only 2 thread for ldm always make the node working.

If I use cpubind instead of cpuset (ldm={count=4,cpubind=2-5}), I get a different error:

2018-11-01 21:56:46 [ndbd] INFO     -- /export/home2/pb2/build/sb_1-30900976-1539000784.68/rpm/BUILD/mysql-cluster-gpl-7.6.8/mysql-cluster-gpl-7.6.8/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2018-11-01 21:56:46 [ndbd] INFO     -- DBLQH (Line: 18014) 0x00000002 Check !isNdbMtLqh() failed
2018-11-01 21:56:46 [ndbd] INFO     -- Error handler shutting down system
2018-11-01 21:56:46 [ndbd] INFO     -- Error handler shutdown completed - exiting
2018-11-01 21:56:46 [ndbd] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

With 2 ldm thread, even with cpubind, it work.

It may be it's my fault, but I can't understand what's wrong.

Hi Andrea,

Please send us the full log files with trace files (ndb_error_reporter) so we can investigate

thanks
Bogdan

ndb error report

Attachment: ndb_error_report_20181103015351.tar.bz2 (application/x-bzip, text), 1.72 MiB.

Hi Bogdan,
ndb_error_report_20181103015351.tar.bz2 attached.

Bogdan,
I cleaned the log and replicated the error on node 3 just before creating the log, so node 3 log is usefull and clean.

Node 2 log is bigger and may contain a lot of useless info...

On another cluster, with smaller data and less memory but identical configuration, when I try with ldm count=4 it remain stuck after printing a lot of line like this:

2018-11-03 00:39:33 [ndbd] WARNING  -- Ndb kernel thread 2 is stuck in: JobHandling in block: 247, gsn: 365 elapsed=18029
2018-11-03 00:39:33 [ndbd] INFO     -- Watchdog: User time: 2217  System time: 826

One ldm thread is at 100% cpu and stuck forever (well, not forever obviously, but for some hours while it should start in fiew minutes).

In this case too the problem is related with ldm count 4, with ldm count 2 it start normaly. I tried this server because it start faster than the other and I tried to collect log here but I incontered this (different) error so I replicated on the original, bigger, cluster and I took the logs there.

Don't know if it is important but, on both cluster, I'm using:

isolcpus=2-7

and:

IRQBALANCE_BANNED_CPUS=fc

I just noticed there is a trivial error in my config.ini:

2018-11-03 01:37:52 [MgmtSrvr] INFO     -- Loaded config from '/usr/mysql-cluster/ndb_1_config.bin.6'
2018-11-03 01:37:52 [MgmtSrvr] WARNING  -- at line 33: ThreadConfig overrides MaxNoOfExecutionThreads
2018-11-03 01:37:52 [MgmtSrvr] WARNING  -- at line 39: ThreadConfig overrides MaxNoOfExecutionThreads

However the problem persist even without MaxNoOfExecutionThreads.

Hi,

after changing LDM you would at least have to do initial restart of all nodes as  LDM(3) and LDM(4) are not being properly initialised since they have no fragments stored in them. From that point this is not a bug, but I'll go trough traces you provided to see if there's anything else there.

Thanks
Bogdan

Bogdan,
you are right, it was my fault, I'm sorry.

As you can see from node 2 log I tried with initial one time but that run was aborted and next start was not initial.

Not too simple to understand the error, it may be usefull to give some usefull warning on start instead that error at the end, but definitely not a bug with ldm count.

Hi,

> As you can see from node 2 log I tried with initial one time but that 
> run was aborted and next start was not initial.

Yes, so .. you had the good idea but...

> Not too simple to understand the error, it may be usefull to 
> give some usefull warning on start instead that error at the
> end, but definitely not a bug with ldm count.

Error messages in ndbcluster are something that can use a lot of work, we can all agree with that, especially these advanced functionalities; but we'r kinda pushing on stability and usability still adding more features to increase usability relying on users trying advanced options to either figure it out or to bounce info trough support/forum/bug system... with some additional info pushed trough our devs blogs.. so not ideal situation but will get better :)

thanks
Bogdan

btw note that it's possible that in certain situation to increase you even have to do a backup/restore as initial restart might fail