MySQL Bugs: #92158: NDB Cluster 5.7.23 crush

Bug #92158	NDB Cluster 5.7.23 crush
Submitted:	23 Aug 2018 13:12	Modified:	27 Aug 2018 10:48
Reporter:	Christoforos Demetriou	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.7.23	OS:	Other (amazon linux)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any (64bit)
Tags:	ndb-cluster

Description:
Hello,

We have ndb cluster 5.7.23 , with 4 data node in 2 groups and 2 management node.
All cluster go down. I check for network problem with amazon and in the logs, but nothing.

Node 1-4: Data Node
Node 200-201: Management Node
Node 5-30: SQL Node

Here the log of the management node 2: 

https://pastebin.com/DHEg2qWq

the data node error:

https://pastebin.com/tyiPL4Tc

can you help understand what happen?

How to repeat:
I don't know :)

To process this bug is necessary you provide a repeatable test case. Thanks.

Hi,

This is not a bug. The log files show you have network errors. MySQL Cluster uses synchronous replication between data nodes and stable network is required for it's proper operation. It's not something you can normally run on AWS.

best regards
Bogdan

Hello all, and thanks :)

I don't understand why this is network issues when i check the messages log from the server /var/log/messages and he did not show any problem, every 2 sec i saw a ping. 

I understand that he missed 4 heartbeats and close, but after it reconnects and then again close.

Regards,

Hi,

> I don't understand why this is network issues when i check the messages log from the server /var/log/messages and he did not show any problem,

Why would you find anything related to network quality in messages? No errors in message log does not mean your network connection is ok. Running MTR for a while might show some issues.

> I understand that he missed 4 heartbeats and close, but after it reconnects and then again close.

Yes, due to network errors nodes will shutdown and when not enough nodes are up to form majority consensus they will shutdown to prevent further issues. You can try mysql cluster 7.6, it's less sensitive to bad network but what you can see in the log is a network issue. You can tweak the config a bit and for that you can contact our support team to help you out.

kind regards
Bogdan

Thanks for your help and for everything :) 
Regards,

Hello again, and sorry for the trouble :)

What is the variable that we can change in the management node and in the data node, to increase the heartbeat max time that waits to get the answer? 

This variable it has to change only in the management node or also in the data node and in the sql node?

Thanks and sorry for the trouble :)

Regards,