MySQL Bugs: #89216: Internal program error under heavy write load, with foreign keys

Bug #89216	Internal program error under heavy write load, with foreign keys
Submitted:	12 Jan 2018 20:51	Modified:	16 Jan 2018 10:40
Reporter:	Jorge Campos	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.7.18 ndb-7.6.3	OS:	Ubuntu
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	Error data: pgman.cpp line 685, Error: 2341, ndbrequire

Description:
Hello,

Last night, all ndbcluster nodes on my development rig crashed (forced node shutdown across both nodes). I tried restarting the cluster cluster, but the cluster nodes have been stuck in the "starting" state since around 1am last night (it's now 12:11pm, Pacific time).

Backstory: I migrated over 10 tables from InnoDB into NDB cluster, uisng a 750GB tablespace. My software didn't encounter SQL errors or foreign-key errors when performing mass-inserts into these tables back when they were using the InnoDB engine. Recently, I've decided to switch over to NDB due to its synchronous writes across nodes, and its high-availability (both a must for my application), and am still learning/fine-tuning its configuration parameters.

I'm not too familiar with the NDB internals, but I suspect the cluster crashed due to a heavy write load. Another possibility may be that my database schema uses some foreign keys... I searched online for the error message, and noticed that others fixed the issue by rebuilding their foreign keys.

Questions:
1) How can I get my development cluster past the "starting" state, and bring it back on-line?
2) Do I need to remove all foreign keys from my schema to prevent this error from occurring across my production cluster?

Error Log Message (occurs on both cluster nodes):
Time: Friday 12 January 2018 - 01:00:42
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: pgman.cpp
Error object: PGMAN (Line: 685) 0x00000002 Check false failed
Program: ndbmtd
Pid: 8265 thr: 3
Version: mysql-5.7.18 ndb-7.6.3
Trace file name: ndb_1_trace.log.5_t3
Trace file path: /home/mysql-cluster/data/ndb_1_trace.log.5 [t1..t5]
***EOM***

I've set aside the trace_log files, and can email/upload them for you if you need them.

Thanks!

Best regards,
Jorge

How to repeat:
I'm not sure how to repeat the error. My server was performing a heavy write load on one cluster node (out of 2 cluster nodes), using multiple processes that insert data concurrently. The presence of foreign keys may have been a contributing factor, but I'm not sure.

Suggested fix:
I haven't been able to resolve this yet. Restarting the cluster has left the cluster nodes stuck in the "starting" state for over 11 hours now. (It typically takes less than 1 minute to start the cluster.)

Hi,

This is not a bug but miss configuration. This is not a place to help you to properly configure your cluster, for that you have to contact our support & consulting team and they can go with you trough the whole system and help you get max out of your hw.

Basic issue you are hitting right now is that you have a lot of disk data (~416G) and a very small DiskPageBuffer (default 64M) so increasing it should get you trough the initial problem

all best
Bogdan