MySQL Bugs: #67126: LCP fragment scan watchdog crash in DBLQH in 7.2.8 after 60 seconds

Bug #67126	LCP fragment scan watchdog crash in DBLQH in 7.2.8 after 60 seconds
Submitted:	7 Oct 2012 23:58	Modified:	3 May 2013 6:36
Reporter:	Carl Krumins	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.2.8	OS:	Linux (Linux Generic x68 binaries)
Assigned to:		CPU Architecture:	Any
Tags:	crash, DBLQH, LCP watchdog, node restart

Description:
Using 7.2.8 Linux Generic x64 Binaries
As background, have previously reported bugID 64788 with was marked fixed in 7.2.6 but recently a similiar DBDIH error again in 7.2.8 when restarting a node with --initial copying data from existing running nodes. This DBDIH error appears to dissapear after a 'few' restarts.

New main reported issue is however with LCP fragment scan watchdog restarting node after 60 seconds. This is straight away after upgrading data node to 7.2.8 (was running version without LCP watchdog prior - was 7.2.5 + patch from bug64788)

Have seem many multiple: 
2012-10-07 09:40:47 [MgmtSrvr] WARNING  -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 20 s.  184467440737095
warnings where the watchdog reported no progress for 20 seconds, however this warning cleared after the 20 seconds and didn't continue on to 30s/40s etc,
However one time the watchdog made it to 60seconds and closed down the data node.

Time: Sunday 7 October 2012 - 10:50:54
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem.  Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 23661) 0x00000004
Program: ndbmtd
Pid:
Time: Sunday 7 October 2012 - 10:50:54
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: /pb2/build/sb_0-6682536-1345655787.21/mysql-cluster-gpl-7.2.8/storage/ndb/src/kernel/ndbd.cpp
Program: ndbmtd
Pid: 16593 thr: 1
Version: mysql-5.5.27 ndb-7.2.8

2012-10-07 10:42:47 [MgmtSrvr] WARNING  -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 40 s.  184467440737095
2012-10-07 10:42:56 [MgmtSrvr] WARNING  -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 50 s.  184467440737095
2012-10-07 10:43:06 [MgmtSrvr] WARNING  -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 60 s.  184467440737095
2012-10-07 10:43:06 [MgmtSrvr] WARNING  -- Node 4: LCP Frag watchdog : Checkpoint of table 4294967295 fragment 4294967295 too slow (no progress fo
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: == LQH LCP STATE ==
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4:  clcpCompletedState=1, c_lcpId=5067, cnoOfFragsCheckpointed=142
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4:  lcpState=3 lastFragmentFlag=0
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: currentFragment.fragPtrI=137
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: currentFragment.lcpFragOrd.tableId=175
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4:  lcpQueued=0 reportEmpty=0
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4:  m_EMPTY_LCP_REQ=0000000000000000
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: -- Node 4 LCP STATE --
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: lcpStatus = 2 (update place = 11173)
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: lcpStart = 0 lcpStopGcp = 6787850 keepGci = 0 oldestRestorable = 0
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: immediateLcpStart = 0 masterLcpNodeId = 3
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 0 : status: 6 place: 10999
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 1 : status: 5 place: 811
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 2 : status: 0 place: 14102
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 3 : status: 10 place: 13911
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 4 : status: 9 place: 13530
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 5 : status: 2 place: 11173
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 6 : status: 6 place: 10999
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 7 : status: 5 place: 811
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 8 : status: 0 place: 14102
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: 9 : status: 10 place: 13911
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: -- Node 4 LCP STATE --
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: ParticipatingDIH = 000000000000003c
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: ParticipatingLQH = 0000000000000038
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=3 0000000000000038]
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: BackupRecord 0: BackupId: 5067 MasterRef: 6f70004 ClientRef: 0
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4:  State: 5
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4:  file 0: type: 3 flags: H'0
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: Backup - dump pool sizes
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: BackupPool: 2 BackupFilePool: 4 TablePool: 14671
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: AttrPool: 2 TriggerPool: 4 FragmentPool: 14671
.....
more in logs - too long for description
.....
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4:  file 0: type: 3 flags: H'0
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: Backup - dump pool sizes
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: BackupPool: 2 BackupFilePool: 4 TablePool: 14671
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: AttrPool: 2 TriggerPool: 4 FragmentPool: 14671
2012-10-07 10:43:06 [MgmtSrvr] INFO     -- Node 4: PagePool: 1579
...
2012-10-07 10:52:45 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed, restarting. Initiated by signal 6. Caused by error 7200: 'LCP fragment scan watchdog detected a problem.  Please r
eport a bug.(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

....

Also separately on the same node have seen the below errors on 7.2.8 when restarting a node with --initial and copying the data from an existing data node. (looks like similar issue to one part of previously reported bug 64788 however other parts of the previous bug report have not appeared since)
======
Time: Sunday 7 October 2012 - 15:10:38
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbdihMain.cpp
Error object: DBDIH (Line: 14744) 0x00000000
Program: ndbmtd
Pid: 23638 thr: 0
Version: mysql-5.5.27 ndb-7.2.8

Ans also on 7.2.8
========
Time: Monday 8 October 2012 - 05:21:48
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbtcMain.cpp
Error object: DBTC (Line: 8192) 0x00000000
Program: ndbmtd
Pid: 15091 thr: 10
Version: mysql-5.5.27 ndb-7.2.8

Let me know if you require any more information.

Uploading ndb_error_report to FTP

How to repeat:
LCP fragment scan watchdog seems like it will fail any time again soon with semi-constant messages:
2012-10-07 09:40:47 [MgmtSrvr] WARNING  -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 20 s.  184467440737095

Suggested fix:
Not sure.

Have uploaded ndb_error_reported file to FTP filename:
ndb_error_report-67126.tar.bz2

This issue has been fixed and documented as follows in the 7.0.37, 7.1.26, 7.2.10, and 5.7.1 changelogs:

        The recently added LCP fragment scan watchdog occasionally
        reported problems with LCP fragment scans having very high table
        id, fragment id, and row count values.

        This was due to the watchdog not accounting for the time spent
        draining the backup buffer used to buffer rows before writing to
        the fragment checkpoint file.

        Now, in the final stage of an LCP fragment scan, the watchdog
        switches from monitoring rows scanned to monitoring the buffer
        size in bytes. The buffer size should decrease as data is
        written to the file, after which the file should be promptly
        closed.