MySQL Bugs: #79228: Forced node shutdown, LCP fragment scan watchdog detected a problem, error 7200

Bug #79228	Forced node shutdown, LCP fragment scan watchdog detected a problem, error 7200
Submitted:	11 Nov 2015 13:53	Modified:	24 Jan 2020 14:29
Reporter:	The Bunker The Bunker	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: NDB API	Severity:	S1 (Critical)
Version:	ndb-7.4.8	OS:	Debian
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
We have a cluster of 4 ndb nodes, 2 api nodes and 2 mgmt nodes.

After loading in 50 extra test databases via the api nodes, a couple of hours later one of the ndb nodes shut down:

2015-11-06 19:09:52 [MgmtSrvr] WARNING  -- Node 1: LCP Frag watchdog : No progress on table 1626, frag 22 for 40 s.  15 bytes remaining.

2015-11-06 19:09:55 [ndbd] INFO     -- Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
2015-11-06 19:09:55 [ndbd] INFO     -- DBLQH (Line: 25390) 0x00000002
2015-11-06 19:09:55 [ndbd] INFO     -- Error handler shutting down system
2015-11-06 19:09:55 [ndbd] INFO     -- Error handler shutdown completed - exiting
2015-11-06 19:10:02 [ndbd] ALERT    -- Node 1: Forced node shutdown completed. Caused by error 7200: 'LCP fragment scan watchdog detected a problem.  Please report a bug.(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Time: Friday 6 November 2015 - 19:09:55
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem.  Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 25390) 0x00000002
Program: ndbmtd
Pid

Management logs:
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: 9 : status: 0 place: 17713
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: -- Node 1 LCP STATE --
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: ParticipatingDIH = 000000000000001e
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: ParticipatingLQH = 000000000000001e
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: m_LCP_COMPLETE_REP_From_Master_Received = 0

2015-11-06 19:09:52 [MgmtSrvr] WARNING  -- Node 1: LCP Frag watchdog : No progress on table 1628, fr
ag 10 for 60 s.  15 bytes remaining.
2015-11-06 19:09:52 [MgmtSrvr] WARNING  -- Node 1: LCP Frag watchdog : Checkpoint of table 1628 fragment 10 too slow (no progress for > 60 s).
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: BackupRecord 0: BackupId: 405 MasterRef: 6f70001 ClientRef: 0
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  State: 5
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  file 0: type: 3 flags: H'21
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: Backup - dump pool sizes
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: BackupPool: 2 BackupFilePool: 4 TablePool: 20321
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: AttrPool: 2 TriggerPool: 4 FragmentPool: 20321
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: PagePool: 1571
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: == LQH LCP STATE ==
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  clcpCompletedState=1, c_lcpId=405, cnoOfFragsCheckpointed=1824
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  lcpState=3 lastFragmentFlag=0
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: currentFragment.fragPtrI=21833
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: currentFragment.lcpFragOrd.tableId=1628
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  reportEmpty=0
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  m_EMPTY_LCP_REQ=0000000000000000
2015-11-06 19:09:52 [MgmtSrvr] WARNING  -- Node 1: LCP Frag watchdog : No progress on table 1626, frag 22 for 40 s.  15 bytes remaining.

This node was restarted 2 days later on the Monday, another shut down in the same way on Tuesday:

Time: Tuesday 10 November 2015 - 06:15:40
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem.  Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 25390) 0x00000002
Program: ndbmtd

Then node3 shut down with the same error.

We dropped some of the databases, and it has not happened since.

How to repeat:
create multiple copies (between 50 and 60, using different names) of the same ndb database with approximately 100 tables using the mysql node.

ndb_error_report_20151111150920.tar.bz2 uploaded to sftp.oracle.com

Uploaded sample sql structure that was used, loaded in as 50 unique database via API node, ie:

mysql> create database test1;
mysql> use test1;
mysql> source obsStruct.sql;

mysql> create database test2;
mysql> use test2;
mysql> source obsStruct.sql;

Hi,

I cannot reproduce this on any of the modern MySQL Cluster versions, but looking at the logs this looks like a hardware issue.

kind regards
Bogdan