MySQL Bugs: #91100: NDB node can not pass phase 5 after restart. Error:2303

Bug #91100	NDB node can not pass phase 5 after restart. Error:2303
Submitted:	1 Jun 2018 5:31	Modified:	29 Jun 2018 11:30
Reporter:	Tsvetomir Penchev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	mysql-5.6.22 ndb-7.3.8	OS:	CentOS (Linux 3.10.0-514.el7.x86_64)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86
Tags:	error: 2303, error: 744, Killed by node as copyfrag failed, MySQL Cluster

Description:
NDB node can not pass phase 5.
Forced node shutdown completed. Occured during startphase 5. Caused by error 2303: 'System error, node killed during node restart by other node(Internal
 error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
MySQL cluster was running for a long time. We need to restart machine where 12 was running (with star* at this time). We stopped it and then 11 becomes the one with the star. After that we restart the machine but 12 never finish stage 5.
Tried with deleting everything from /dbpool/db4/ndb_data, also with --initial but no success

This looks like a duplicate of bug #90940 

Can you share more details about your operation? 
Are you by any chance running DDL's often, or anything else that might be interesting?

thanks
Bogdan

Hello Bogdan,

Tried to find any similar problem on this site and in google. Could not find Error:2303 paired with Error 744. Because of this I opened this one.

Database was created at the start of the project and it is with fixed structure. After that no DDL commands are executed. There are tablespaces and tables are grouped in these tablespaces. Tables has max rows definitions. All tables are created with ".. STORAGE DISK ENGINE=NDB...". 

Sample:
	CREATE TABLE BLOCKA.tblRooms (
		ROOM varchar(20) NOT NULL,
		ROOM_NAME varchar(100) NOT NULL,
		ROOM_DATA_LOGIC INT NOT NULL,
		ROOM_UNITS	BIGINT UNSIGNED NOT NULL,
		PRIMARY KEY ROOM(ROOM)
	)
	COLLATE='utf8_general_ci'
	TABLESPACE ts_RoomCommon STORAGE DISK
	ENGINE=NDB
	MAX_ROWS=200000;

Applications connected to database are multi-threaded. They are using mainly DCLs. There are some cases when transaction is needed to keep info correct.

We also did following: stopped all applications, closed both mysqld nodes and tried to  start node 12. Result was same error at same stage. 

Regards,
Tsvetomir

Sorry miss typing: Applications are using mainly DMLs (SELECT, INSERT, UPDATE, DELETE)

Hi,
error is bit different as is cluster version but looks like the same issue.

Let us analyze bit more 

thanks for additional info
Bogdan

Hello,

Ref to https://downloads.mysql.com/docs/mysql-cluster-excerpt-5.6-en.pdf 
Error 2303: Disk Data and GCP Stop errors. Errors encountered when using Disk Data tables such as Node nodeid killed this node because GCP stop was detected (error 2303) are often referred to as “GCP stop errors”. Such errors occur when the redo log is not flushed to disk quickly enough; this is usually due to slow disks and insufficient disk throughput.	

Based on the above statement we checked disks where database files are placed. They was paired SAS 10k 300GB in RAID-1 configuration. In the name of science we added two new disks and it become RAID-1+0. On theory new configuration should have twice more iops. Started ndbd process and database was replicated on first attempt. Replication was done without stopping database clients! 
Wait few days to ensure that everything was OK and did same with other machine. 
Currently we have all nodes up and running.

Don't know if this was root cause but just wanted to share for other why have similar problem.