MySQL Bugs: #85584: MYSQL cluster restart failing after initialization

Bug #85584	MYSQL cluster restart failing after initialization
Submitted:	22 Mar 2017 15:49	Modified:	25 Apr 2017 13:23
Reporter:	Zeljko Zuvic	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	ndb-7.4.7	OS:	CentOS (6.6)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	No message slogan found

Description:
Hi,

We have Mysql cluster real time replication with one managed node and 2 data nodes. 
A few days ago we want to change  cluster configuration in order to enable ndb cluster backup and also tuning some parameters for redo logs.
After cluster initialization ( per documentation required for these config change ) and did restore of mysql BD data ( made with mysqldump ) everything passed fine, but when we tried sanity cluster restart we are getting this error on one of the nodes during ndbd service start:
"Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 32782: 'No message slogan found (please report a bug if you get this error code)(Unknown). Unknown'."
And whenever we try to restart the cluster after initialization and data restore we are getting the same error message, although we rolled back the cluster configuration to previous state.

We didn't find out anything of the cause from this error message, could you please help us to find out and solve the problem

Thanks,

Zeljko

How to repeat:
1. start ndb_mgmd ( arbitrator node )
2. start ndbd services with initial option
3. start mysqld services
4. restore DB data from backup ( made by mysqldump)
5. stop mysqld services 
6. restart ndbd services without initial

Hi,

I understand you can't fetch logs with ndb_error_reporter but you can compress and upload all logs from the crashing node manually + you upload logs from your management nodes.

If you start the node that crashed with " Node 2: Forced node shutdown completed. Occured during startphase 4. Caused by error 32782: 'No message slogan found (please report a bug if you get this error code)(Unknown). Unknown'." again, will it start or will it again stop at startphase4 ?

To get your cluster up, start the crashing node with --initial so it can re-fetch the data from surviving node.

all best
Bogdan

p.s. I tried reproducing this with 7.4.7 without any luck

Hi Bogdane,

I have just uploaded required logs from data nodes and management node and I hope it should be enough for troubleshooting. 
Also I tried to start node again after it crushed but the same failed with the same error at startphase 4.  
At the same time another node is failing with error: "-- Node 2: Forced node shutdown completed. Occured during startphase 4. Caused
by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart err
or). Temporary error, restart node'.
"
And vice versa, sometimes during restart 1st node is failing with the error "No message slogan found ...." and 2nd then is stopping with "Another node failed during system restart ...."

So it looks hopeless so far.

Many thanks for your support!

Zeljko

Hi Zeljko,

The errors you are getting are:File system open failed. OS errno: 4294967295

so there is a problem with your file system. Possible reasons
- wrong permissions of the files
- wrong ownership of the files
- filesystem corruption
- hardware error

But I doubt it's related to mysql cluster itself.

Can you check the cluster data directory for permissions/ownership settings and can you please check the whole filesystem too. Do you have some antivirus?

all best
Bogdan

Hi Zeljko,

> Maybe is worth to mention that we have encrypted partition
..
> Regarding antivirus, yes we have some version
..
> Do you have some recommendation what should be our next step to do?

Well, since you are not able to reproduce this (exactly the same everything) I can only guess but I doubt it's related to MySQL :(

The steps now
 - check all your system/kernel log to see if any issues with that encrypted partition
 - check log from your antivir
 - disable antivir for the cluster datadir

I think it was the antivir that's corrupting the cluster datadir. MySQL Cluster is not playing well with them (nor is regular MySQL Server) and if any time they fight over file the mccge with commit suicide.

all best
Bogdan

Hi Bogdane,

You was absolutely right and AV caused the issue we had.
After we excluded the mysql files for AV everything started to work again.
Many thanks for the support!

Zeljko

Thanks for the update

uzdravlje
Bogdan