Description:
Hello,
Last night, all ndbcluster nodes on my development rig crashed (forced node shutdown across both nodes). I tried restarting the cluster cluster, but the cluster nodes have been stuck in the "starting" state since around 1am last night (it's now 12:11pm, Pacific time).
Backstory: I migrated over 10 tables from InnoDB into NDB cluster, uisng a 750GB tablespace. My software didn't encounter SQL errors or foreign-key errors when performing mass-inserts into these tables back when they were using the InnoDB engine. Recently, I've decided to switch over to NDB due to its synchronous writes across nodes, and its high-availability (both a must for my application), and am still learning/fine-tuning its configuration parameters.
I'm not too familiar with the NDB internals, but I suspect the cluster crashed due to a heavy write load. Another possibility may be that my database schema uses some foreign keys... I searched online for the error message, and noticed that others fixed the issue by rebuilding their foreign keys.
Questions:
1) How can I get my development cluster past the "starting" state, and bring it back on-line?
2) Do I need to remove all foreign keys from my schema to prevent this error from occurring across my production cluster?
Error Log Message (occurs on both cluster nodes):
Time: Friday 12 January 2018 - 01:00:42
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: pgman.cpp
Error object: PGMAN (Line: 685) 0x00000002 Check false failed
Program: ndbmtd
Pid: 8265 thr: 3
Version: mysql-5.7.18 ndb-7.6.3
Trace file name: ndb_1_trace.log.5_t3
Trace file path: /home/mysql-cluster/data/ndb_1_trace.log.5 [t1..t5]
***EOM***
I've set aside the trace_log files, and can email/upload them for you if you need them.
Thanks!
Best regards,
Jorge
How to repeat:
I'm not sure how to repeat the error. My server was performing a heavy write load on one cluster node (out of 2 cluster nodes), using multiple processes that insert data concurrently. The presence of foreign keys may have been a contributing factor, but I'm not sure.
Suggested fix:
I haven't been able to resolve this yet. Restarting the cluster has left the cluster nodes stuck in the "starting" state for over 11 hours now. (It typically takes less than 1 minute to start the cluster.)