MySQL Bugs: #69205: fresh 7.2.12 LCP Watchdog crash datanode error 7200 DBLQH 23869

Bug #69205	fresh 7.2.12 LCP Watchdog crash datanode error 7200 DBLQH 23869
Submitted:	12 May 2013 18:58	Modified:	19 May 2016 12:00
Reporter:	Carl Krumins	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.2.12	OS:	Linux (Oracle Linux 6 binaries 64bit)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	crash, DBLQH, error 7200, LCP, LCP frag watchdog

Description:
Fresh cluster 7.2.12 on Oracle Linux using Oracle Linux 6 64bit Binaries. 

Time: Monday 13 May 2013 - 09:25:27
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem.  Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 23869) 0x00000000
Program: ndbmtd

Fresh clsuter. Load data in from ndb_restore. 

Extract from *_out* files. Orig files in .zip attached.

LCP Frag watchdog : No progress on table 336, frag 14 for 30 s.  1988560 rows completed
LCP Frag watchdog : No progress on table 336, frag 13 for 40 s.  1930000 rows completed
LCP Frag watchdog : No progress on table 336, frag 14 for 40 s.  1988560 rows completed
LCP Frag watchdog : No progress on table 336, frag 13 for 50 s.  1930000 rows completed
LCP Frag watchdog : No progress on table 336, frag 14 for 50 s.  1988560 rows completed
LCP Frag watchdog : No progress on table 336, frag 13 for 60 s.  1930000 rows completed
LCP Frag watchdog : Checkpoint of table 336 fragment 13 too slow (no progress for > 60 s).
m_curr_disk_write_speed: 262144  m_words_written_this_period: 0  m_overflow_disk_write: 0
m_curr_disk_write_speed: 262144  m_words_written_this_period: 0  m_overflow_disk_write: 0m_reset_delay_used: 100  m_reset_disk_speed_time: 195693860m_curr_disk_write_speed: 262144  m_words_written_
this_period: 0  m_overflow_disk_write: 0m_curr_disk_write_speed: 262144  m_words_written_this_period: 0  m_overflow_disk_write: 0m_curr_disk_write_speed: 262144  m_words_written_this_period: 0  m_o
verflow_disk_write: 0m_curr_disk_write_speed: 262144  m_words_written_this_period: 0  m_overflow_disk_write: 0
m_monitor_words_written : 0, duration : 320 millis, rate : 0 bytes/s : (0 pct of config)
BackupRecord 0:  BackupId: 139  MasterRef: ef70002  ClientRef: 0
 State: 4
 noOfByte: 1012003044  noOfRecords: 10236240
 noOfLogBytes: 0  noOfLogRecords: 0
 errorCode: 0
 file 0:  type: 3  flags: H'19  tableId: 336  fragmentId: 0
ready: TRUE  eof: FALSE
m_curr_disk_write_speed: 262144  m_words_written_this_period: 0  m_overflow_disk_write: 0
m_reset_delay_used: 100  m_reset_disk_speed_time: 195693853
m_monitor_words_written : 0, duration : 427 millis, rate : 0 bytes/s : (0 pct of config)

m_reset_delay_used: 100  m_reset_disk_speed_time: 195693857

m_reset_delay_used: 100  m_reset_disk_speed_time: 195693856
m_monitor_words_written : 0, duration : 124 millis, rate : 0 bytes/s : (0 pct of config)
BackupRecord 0:  BackupId: 139  MasterRef: 8f70002  ClientRef: 0

.. etc ...
	

How to repeat:
as above.

Suggested fix:
not sure

Same thing has been happening to me every few days.

Attaching the 3 sets of trace files for the 3 crashes shown below...

Time: Friday 17 May 2013 - 12:14:22
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem. Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 23869) 0x00000002
Program: ndbmtd
Pid: 59
Time: Monday 20 May 2013 - 20:02:16
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem. Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 23869) 0x00000002
Program: ndbmtd
Pid: 21
Time: Thursday 23 May 2013 - 14:57:43
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem. Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 23869) 0x00000002
Program: ndbmtd
Pid:

3 sets of trace files from crashes

Attachment: archive.zip (application/zip, text), 902.31 KiB.

ndb_*out.log file

Attachment: ndb_11_out.zip (application/zip, text), 247.27 KiB.

Still an ongoing issue with MySQL Cluster 7.3.5...

2014-05-11 16:58:35 [ndbd] INFO     -- Watchdog: User time: 1121296  System time: 795741
2014-05-11 16:58:35 [ndbd] WARNING  -- Ndb kernel thread 5 is stuck in: Job Handling elapsed=3403
2014-05-11 16:58:35 [ndbd] INFO     -- Watchdog: User time: 1121296  System time: 795741
2014-05-11 16:58:35 [ndbd] WARNING  -- Ndb kernel thread 5 is stuck in: Job Handling elapsed=3503
2014-05-11 16:58:36 [ndbd] INFO     -- Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
2014-05-11 16:58:36 [ndbd] INFO     -- DBLQH (Line: 23974) 0x00000002
2014-05-11 16:58:36 [ndbd] INFO     -- Error handler shutting down system
2014-05-11 16:58:36 [ndbd] INFO     -- Watchdog: User time: 1121305  System time: 795743
2014-05-11 16:58:36 [ndbd] INFO     -- Error handler shutdown completed - exiting
2014-05-11 16:58:47 [ndbd] ALERT    -- Node 11: Forced node shutdown completed. Caused by error 7200: 'LCP fragment scan watchdog detected a problem.  Please report a bug.(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

This is usually a problem with overloaded system or improperly sized cluster. The 7.4 redesigned the way this works so an upgrade is advised.

cpu/io load measurements during operations that cause these errors to occur are needed to properly determine the behavior but upgrade to 7.4.10+ should solve the problem