Bug #82514 Data Nodes Randomly Crashing
Submitted: 9 Aug 2016 14:13 Modified: 8 Sep 2016 6:35
Reporter: Joel Hanger Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql Ver 14.14 Distrib 5.6.17-ndb-7.3. OS:CentOS (release 6.5 (Final) x64)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: 7200, error, LCP fragment scan, node failure

[9 Aug 2016 14:13] Joel Hanger
Description:
We've been experiencing random node failures. 
The most recent one resulted in a lost transaction. 

Our configuration is 8 data nodes on AWS, 3 API (mysqld) nodes, and 2 MGM nodes.

Mysqld error log only reports:

2016-08-09 13:37:01 25595 [ERROR] Got error 4028 when reading table './gq_alpha/user_platform'
2016-08-09 13:37:01 25595 [Note] NDB Binlog: Node: 6, down, Subscriber bitmask 00
2016-08-09 13:38:15 25595 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00

Errors on the data nodes report:

Time: Tuesday 9 August 2016 - 13:36:46
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem.  Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 23974) 0x00000002
Program: ndbmtd
Pid:

How to repeat:
Do not know how to repeat as I'm unable to determine what caused the 2 data nodes to crash. 

Suggested fix:
Analyze logs and determine if this is a potentially fixed issue in newer versions.

Upgrading versions at this point is the target goal, however it won't be implemented immediately. 

Perhaps logs reveal enough information to analyze newer versions for same bug and/or if it's fixed in newer versions.
[9 Aug 2016 14:43] Joel Hanger
Tarball part 1 of 3

Attachment: ndb_error_report_20160809135436-1.tar.bz2 (application/x-bzip, text), 1.48 MiB.

[9 Aug 2016 14:43] Joel Hanger
Tarball part 2 of 3

Attachment: ndb_error_report_20160809135436-2.tar.bz2 (application/x-bzip, text), 1.87 MiB.

[9 Aug 2016 14:44] Joel Hanger
Tarball part 3 of 3

Attachment: ndb_error_report_20160809135436-3.tar.bz2 (application/x-bzip, text), 1.27 MiB.

[8 Sep 2016 6:35] MySQL Verification Team
Hi,

This I can't reproduce with 7.3.14 that is the latest 7.3, but if you want to use ndbcluster on AWS you should really look at 7.4 as we adapted the architecture to better suit virtualized environment. 

kind regards
Bogdan Kecman