Bug #79228 Forced node shutdown, LCP fragment scan watchdog detected a problem, error 7200
Submitted: 11 Nov 2015 13:53 Modified: 24 Jan 2020 14:29
Reporter: The Bunker The Bunker Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: NDB API Severity:S1 (Critical)
Version:ndb-7.4.8 OS:Debian
Assigned to: MySQL Verification Team CPU Architecture:Any

[11 Nov 2015 13:53] The Bunker The Bunker
Description:
We have a cluster of 4 ndb nodes, 2 api nodes and 2 mgmt nodes.

After loading in 50 extra test databases via the api nodes, a couple of hours later one of the ndb nodes shut down:

2015-11-06 19:09:52 [MgmtSrvr] WARNING  -- Node 1: LCP Frag watchdog : No progress on table 1626, frag 22 for 40 s.  15 bytes remaining.

2015-11-06 19:09:55 [ndbd] INFO     -- Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
2015-11-06 19:09:55 [ndbd] INFO     -- DBLQH (Line: 25390) 0x00000002
2015-11-06 19:09:55 [ndbd] INFO     -- Error handler shutting down system
2015-11-06 19:09:55 [ndbd] INFO     -- Error handler shutdown completed - exiting
2015-11-06 19:10:02 [ndbd] ALERT    -- Node 1: Forced node shutdown completed. Caused by error 7200: 'LCP fragment scan watchdog detected a problem.  Please report a bug.(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Time: Friday 6 November 2015 - 19:09:55
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem.  Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 25390) 0x00000002
Program: ndbmtd
Pid

Management logs:
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: 9 : status: 0 place: 17713
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: -- Node 1 LCP STATE --
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: ParticipatingDIH = 000000000000001e
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: ParticipatingLQH = 000000000000001e
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: m_LCP_COMPLETE_REP_From_Master_Received = 0

2015-11-06 19:09:52 [MgmtSrvr] WARNING  -- Node 1: LCP Frag watchdog : No progress on table 1628, fr
ag 10 for 60 s.  15 bytes remaining.
2015-11-06 19:09:52 [MgmtSrvr] WARNING  -- Node 1: LCP Frag watchdog : Checkpoint of table 1628 fragment 10 too slow (no progress for > 60 s).
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: BackupRecord 0: BackupId: 405 MasterRef: 6f70001 ClientRef: 0
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  State: 5
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  file 0: type: 3 flags: H'21
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: Backup - dump pool sizes
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: BackupPool: 2 BackupFilePool: 4 TablePool: 20321
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: AttrPool: 2 TriggerPool: 4 FragmentPool: 20321
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: PagePool: 1571
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: == LQH LCP STATE ==
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  clcpCompletedState=1, c_lcpId=405, cnoOfFragsCheckpointed=1824
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  lcpState=3 lastFragmentFlag=0
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: currentFragment.fragPtrI=21833
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1: currentFragment.lcpFragOrd.tableId=1628
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  reportEmpty=0
2015-11-06 19:09:52 [MgmtSrvr] INFO     -- Node 1:  m_EMPTY_LCP_REQ=0000000000000000
2015-11-06 19:09:52 [MgmtSrvr] WARNING  -- Node 1: LCP Frag watchdog : No progress on table 1626, frag 22 for 40 s.  15 bytes remaining.

This node was restarted 2 days later on the Monday, another shut down in the same way on Tuesday:

Time: Tuesday 10 November 2015 - 06:15:40
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem.  Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 25390) 0x00000002
Program: ndbmtd

Then node3 shut down with the same error.

We dropped some of the databases, and it has not happened since.

How to repeat:
create multiple copies (between 50 and 60, using different names) of the same ndb database with approximately 100 tables using the mysql node.
[11 Nov 2015 15:21] The Bunker The Bunker
ndb_error_report_20151111150920.tar.bz2 uploaded to sftp.oracle.com
[16 Nov 2015 17:49] The Bunker The Bunker
Uploaded sample sql structure that was used, loaded in as 50 unique database via API node, ie:

mysql> create database test1;
mysql> use test1;
mysql> source obsStruct.sql;

mysql> create database test2;
mysql> use test2;
mysql> source obsStruct.sql;
[24 Jan 2020 14:29] MySQL Verification Team
Hi,

I cannot reproduce this on any of the modern MySQL Cluster versions, but looking at the logs this looks like a hardware issue.

kind regards
Bogdan