Bug #92158 NDB Cluster 5.7.23 crush
Submitted: 23 Aug 2018 13:12 Modified: 27 Aug 2018 10:48
Reporter: Christoforos Demetriou Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.7.23 OS:Other (amazon linux)
Assigned to: MySQL Verification Team CPU Architecture:Any (64bit)
Tags: ndb-cluster

[23 Aug 2018 13:12] Christoforos Demetriou
Description:
Hello,

We have ndb cluster 5.7.23 , with 4 data node in 2 groups and 2 management node.
All cluster go down. I check for network problem with amazon and in the logs, but nothing.

Node 1-4: Data Node
Node 200-201: Management Node
Node 5-30: SQL Node

Here the log of the management node 2: 

https://pastebin.com/DHEg2qWq

the data node error:

https://pastebin.com/tyiPL4Tc

can you help understand what happen?

How to repeat:
I don't know :)
[23 Aug 2018 15:13] MySQL Verification Team
To process this bug is necessary you provide a repeatable test case. Thanks.
[23 Aug 2018 15:18] MySQL Verification Team
Hi,

This is not a bug. The log files show you have network errors. MySQL Cluster uses synchronous replication between data nodes and stable network is required for it's proper operation. It's not something you can normally run on AWS.

best regards
Bogdan
[24 Aug 2018 12:24] Christoforos Demetriou
Hello all, and thanks :)

I don't understand why this is network issues when i check the messages log from the server /var/log/messages and he did not show any problem, every 2 sec i saw a ping. 

I understand that he missed 4 heartbeats and close, but after it reconnects and then again close.

Regards,
[24 Aug 2018 12:43] MySQL Verification Team
Hi,

> I don't understand why this is network issues when i check the messages log from the server /var/log/messages and he did not show any problem,

Why would you find anything related to network quality in messages? No errors in message log does not mean your network connection is ok. Running MTR for a while might show some issues.

> I understand that he missed 4 heartbeats and close, but after it reconnects and then again close.

Yes, due to network errors nodes will shutdown and when not enough nodes are up to form majority consensus they will shutdown to prevent further issues. You can try mysql cluster 7.6, it's less sensitive to bad network but what you can see in the log is a network issue. You can tweak the config a bit and for that you can contact our support team to help you out.

kind regards
Bogdan
[24 Aug 2018 14:11] Christoforos Demetriou
Thanks for your help and for everything :) 
Regards,
[27 Aug 2018 10:48] Christoforos Demetriou
Hello again, and sorry for the trouble :)

What is the variable that we can change in the management node and in the data node, to increase the heartbeat max time that waits to get the answer? 

This variable it has to change only in the management node or also in the data node and in the sql node?

Thanks and sorry for the trouble :)

Regards,