Description:
We have a cluster of 4 ndb nodes, 2 api nodes and 2 mgmt nodes.
After loading in 50 extra test databases via the api nodes, a couple of hours later one of the ndb nodes shut down:
2015-11-06 19:09:52 [MgmtSrvr] WARNING -- Node 1: LCP Frag watchdog : No progress on table 1626, frag 22 for 40 s. 15 bytes remaining.
2015-11-06 19:09:55 [ndbd] INFO -- Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
2015-11-06 19:09:55 [ndbd] INFO -- DBLQH (Line: 25390) 0x00000002
2015-11-06 19:09:55 [ndbd] INFO -- Error handler shutting down system
2015-11-06 19:09:55 [ndbd] INFO -- Error handler shutdown completed - exiting
2015-11-06 19:10:02 [ndbd] ALERT -- Node 1: Forced node shutdown completed. Caused by error 7200: 'LCP fragment scan watchdog detected a problem. Please report a bug.(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Time: Friday 6 November 2015 - 19:09:55
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem. Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 25390) 0x00000002
Program: ndbmtd
Pid
Management logs:
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: 9 : status: 0 place: 17713
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: -- Node 1 LCP STATE --
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: ParticipatingDIH = 000000000000001e
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: ParticipatingLQH = 000000000000001e
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=4 000000000000001e]
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: m_LCP_COMPLETE_REP_From_Master_Received = 0
2015-11-06 19:09:52 [MgmtSrvr] WARNING -- Node 1: LCP Frag watchdog : No progress on table 1628, fr
ag 10 for 60 s. 15 bytes remaining.
2015-11-06 19:09:52 [MgmtSrvr] WARNING -- Node 1: LCP Frag watchdog : Checkpoint of table 1628 fragment 10 too slow (no progress for > 60 s).
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: BackupRecord 0: BackupId: 405 MasterRef: 6f70001 ClientRef: 0
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: State: 5
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: file 0: type: 3 flags: H'21
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: Backup - dump pool sizes
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: BackupPool: 2 BackupFilePool: 4 TablePool: 20321
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: AttrPool: 2 TriggerPool: 4 FragmentPool: 20321
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: PagePool: 1571
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: == LQH LCP STATE ==
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: clcpCompletedState=1, c_lcpId=405, cnoOfFragsCheckpointed=1824
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: lcpState=3 lastFragmentFlag=0
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: currentFragment.fragPtrI=21833
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: currentFragment.lcpFragOrd.tableId=1628
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: reportEmpty=0
2015-11-06 19:09:52 [MgmtSrvr] INFO -- Node 1: m_EMPTY_LCP_REQ=0000000000000000
2015-11-06 19:09:52 [MgmtSrvr] WARNING -- Node 1: LCP Frag watchdog : No progress on table 1626, frag 22 for 40 s. 15 bytes remaining.
This node was restarted 2 days later on the Monday, another shut down in the same way on Tuesday:
Time: Tuesday 10 November 2015 - 06:15:40
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem. Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 25390) 0x00000002
Program: ndbmtd
Then node3 shut down with the same error.
We dropped some of the databases, and it has not happened since.
How to repeat:
create multiple copies (between 50 and 60, using different names) of the same ndb database with approximately 100 tables using the mysql node.