Description:
Using 7.2.8 Linux Generic x64 Binaries
As background, have previously reported bugID 64788 with was marked fixed in 7.2.6 but recently a similiar DBDIH error again in 7.2.8 when restarting a node with --initial copying data from existing running nodes. This DBDIH error appears to dissapear after a 'few' restarts.
New main reported issue is however with LCP fragment scan watchdog restarting node after 60 seconds. This is straight away after upgrading data node to 7.2.8 (was running version without LCP watchdog prior - was 7.2.5 + patch from bug64788)
Have seem many multiple:
2012-10-07 09:40:47 [MgmtSrvr] WARNING -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 20 s. 184467440737095
warnings where the watchdog reported no progress for 20 seconds, however this warning cleared after the 20 seconds and didn't continue on to 30s/40s etc,
However one time the watchdog made it to 60seconds and closed down the data node.
Time: Sunday 7 October 2012 - 10:50:54
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem. Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 23661) 0x00000004
Program: ndbmtd
Pid:
Time: Sunday 7 October 2012 - 10:50:54
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 6 received; Aborted
Error object: /pb2/build/sb_0-6682536-1345655787.21/mysql-cluster-gpl-7.2.8/storage/ndb/src/kernel/ndbd.cpp
Program: ndbmtd
Pid: 16593 thr: 1
Version: mysql-5.5.27 ndb-7.2.8
2012-10-07 10:42:47 [MgmtSrvr] WARNING -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 40 s. 184467440737095
2012-10-07 10:42:56 [MgmtSrvr] WARNING -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 50 s. 184467440737095
2012-10-07 10:43:06 [MgmtSrvr] WARNING -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 60 s. 184467440737095
2012-10-07 10:43:06 [MgmtSrvr] WARNING -- Node 4: LCP Frag watchdog : Checkpoint of table 4294967295 fragment 4294967295 too slow (no progress fo
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: == LQH LCP STATE ==
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: clcpCompletedState=1, c_lcpId=5067, cnoOfFragsCheckpointed=142
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: lcpState=3 lastFragmentFlag=0
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: currentFragment.fragPtrI=137
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: currentFragment.lcpFragOrd.tableId=175
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: lcpQueued=0 reportEmpty=0
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: m_EMPTY_LCP_REQ=0000000000000000
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: -- Node 4 LCP STATE --
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: lcpStatus = 2 (update place = 11173)
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: lcpStart = 0 lcpStopGcp = 6787850 keepGci = 0 oldestRestorable = 0
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: immediateLcpStart = 0 masterLcpNodeId = 3
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 0 : status: 6 place: 10999
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 1 : status: 5 place: 811
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 2 : status: 0 place: 14102
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 3 : status: 10 place: 13911
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 4 : status: 9 place: 13530
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 5 : status: 2 place: 11173
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 6 : status: 6 place: 10999
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 7 : status: 5 place: 811
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 8 : status: 0 place: 14102
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: 9 : status: 10 place: 13911
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: -- Node 4 LCP STATE --
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: ParticipatingDIH = 000000000000003c
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: ParticipatingLQH = 0000000000000038
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_DIH = [SignalCounter: m_count=0 0000000000000000]
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_Counter_LQH = [SignalCounter: m_count=3 0000000000000038]
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: m_LAST_LCP_FRAG_ORD = [SignalCounter: m_count=0 0000000000000000]
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: m_LCP_COMPLETE_REP_From_Master_Received = 0
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: BackupRecord 0: BackupId: 5067 MasterRef: 6f70004 ClientRef: 0
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: State: 5
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: file 0: type: 3 flags: H'0
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: Backup - dump pool sizes
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: BackupPool: 2 BackupFilePool: 4 TablePool: 14671
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: AttrPool: 2 TriggerPool: 4 FragmentPool: 14671
.....
more in logs - too long for description
.....
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: file 0: type: 3 flags: H'0
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: Backup - dump pool sizes
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: BackupPool: 2 BackupFilePool: 4 TablePool: 14671
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: AttrPool: 2 TriggerPool: 4 FragmentPool: 14671
2012-10-07 10:43:06 [MgmtSrvr] INFO -- Node 4: PagePool: 1579
...
2012-10-07 10:52:45 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed, restarting. Initiated by signal 6. Caused by error 7200: 'LCP fragment scan watchdog detected a problem. Please r
eport a bug.(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
....
Also separately on the same node have seen the below errors on 7.2.8 when restarting a node with --initial and copying the data from an existing data node. (looks like similar issue to one part of previously reported bug 64788 however other parts of the previous bug report have not appeared since)
======
Time: Sunday 7 October 2012 - 15:10:38
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbdihMain.cpp
Error object: DBDIH (Line: 14744) 0x00000000
Program: ndbmtd
Pid: 23638 thr: 0
Version: mysql-5.5.27 ndb-7.2.8
Ans also on 7.2.8
========
Time: Monday 8 October 2012 - 05:21:48
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbtcMain.cpp
Error object: DBTC (Line: 8192) 0x00000000
Program: ndbmtd
Pid: 15091 thr: 10
Version: mysql-5.5.27 ndb-7.2.8
Let me know if you require any more information.
Uploading ndb_error_report to FTP
How to repeat:
LCP fragment scan watchdog seems like it will fail any time again soon with semi-constant messages:
2012-10-07 09:40:47 [MgmtSrvr] WARNING -- Node 4: LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 20 s. 184467440737095
Suggested fix:
Not sure.