MySQL Bugs: #90175: Check BackupFormat::NDB_LCP_CTL_FILE_SIZE >= real_bytes

Bug #90175	Check BackupFormat::NDB_LCP_CTL_FILE_SIZE >= real_bytes_read failed
Submitted:	22 Mar 2018 11:22	Modified:	8 Oct 2019 9:43
Reporter:	Daniël van Eeden (OCA)	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.6.4	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
After a cluster failure caused by 'error 7200: 'LCP fragment scan watchdog detected a problem' the cluster fails to start.

Time: Thursday 22 March 2018 - 10:03:50
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: Backup.cpp
Error object: BACKUP (Line: 11288) 0x00000002 Check BackupFormat::NDB_LCP_CTL_FILE_SIZE >= real_bytes_read failed
Program: ndbmtd
Pid: 18463 thr: 4
Version: mysql-5.7.20 ndb-7.6.4
Trace file name: ndb_1_trace.log.10_t4
Trace file path: /var/lib/mysql-cluster/ndb_1_trace.log.10 [t1..t22]
***EOM**

This could be related to some ctl files having a 0 size.

$ sudo find /var/lib/mysql-cluster -name "*ctl" ! -size 4096c -ls
2013615836    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T36F528.ctl
2013615754    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F168.ctl
2013615781    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F336.ctl
2013615782    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F360.ctl
2013615785    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F96.ctl
2013615760    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F192.ctl
2013615787    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T36F216.ctl
2013615843    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F48.ctl
2013615792    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F432.ctl
2013615853    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T36F480.ctl
2013615799    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F408.ctl
2013615808    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T35F264.ctl
2013615763    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F240.ctl
2013615806    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F456.ctl
2013615862    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F72.ctl
2013615766    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T36F24.ctl
2013372928    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T36F504.ctl
2013615863    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T36F384.ctl
2013615772    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T36F312.ctl
2013615752    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F120.ctl
2013615775    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F288.ctl
2013615812    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T35F552.ctl
2013615753    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T60F144.ctl
$ sudo grep T36F528 /var/lib/mysql-cluster/ndb_1_out.log 
2018-03-13 08:44:15 [ndbd] INFO     -- LDM instance 11: Restored T36F528 LCP 3609 rows, 5081 row operations, 60 millis, 84683 row operations/sec)
2018-03-22 10:02:41 [ndbd] INFO     -- LDM instance 11: Restored T36F528 LCP 3649 rows, 3649 row operations, 37 millis, 98621 row operations/sec)
$ sudo find /var/lib/mysql-cluster -name "*T36F528*" -ls
2013615836    0 -rw-r--r--   1 root     root            0 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/0/T36F528.ctl
2149729907    4 -rw-r--r--   1 root     root         4096 Mar 21 22:45 /var/lib/mysql-cluster/ndb_1_fs/LCP/1/T36F528.ctl
1744831188  160 -rw-r--r--   1 root     root       163840 Mar 12 22:43 /var/lib/mysql-cluster/ndb_1_fs/LCP/8/T36F528.Data
1879245956  512 -rw-r--r--   1 root     root       524288 Mar 12 22:43 /var/lib/mysql-cluster/ndb_1_fs/LCP/9/T36F528.Data
2013269565  224 -rw-r--r--   1 root     root       229376 Mar 12 22:43 /var/lib/mysql-cluster/ndb_1_fs/LCP/10/T36F528.Data
2148918324   32 -rw-r--r--   1 root     root        32768 Mar 12 22:43 /var/lib/mysql-cluster/ndb_1_fs/LCP/11/T36F528.Data
2281895676  672 -rw-r--r--   1 root     root       688128 Mar 12 22:43 /var/lib/mysql-cluster/ndb_1_fs/LCP/12/T36F528.Data
2415919765  352 -rw-r--r--   1 root     root       360448 Mar 13 08:44 /var/lib/mysql-cluster/ndb_1_fs/LCP/13/T36F528.Data
2550327832   32 -rw-r--r--   1 root     root        32768 Mar 13 08:44 /var/lib/mysql-cluster/ndb_1_fs/LCP/14/T36F528.Data
2685071372  512 -rw-r--r--   1 root     root       524288 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/15/T36F528.Data
2820791811   32 -rw-r--r--   1 root     root        32768 Mar 22 10:03 /var/lib/mysql-cluster/ndb_1_fs/LCP/16/T36F528.Data

About the initial cluster failure:

Time: Wednesday 21 March 2018 - 22:46:26
Status: Temporary error, restart node
Message: LCP fragment scan watchdog detected a problem.  Please report a bug. (Internal error, programming error or missing error message, please report a bug)
Error: 7200
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutting down node due to lack of LCP fragment scan progress
Error object: DBLQH (Line: 28711) 0x00000002
Program: ndbmtd
Pid: 37924 thr: 8
Version: mysql-5.7.20 ndb-7.6.4
Trace file name: ndb_13_trace.log.9_t8
Trace file path: /var/lib/mysql-cluster/ndb_13_trace.log.9 [t1..t22]
***EOM***

43 out of 48 data nodes started to complain about "LCP Frag watchdog : No progress on table 4294967295, frag 4294967295 for 20 s." at the same time

How to repeat:
See description

From the logs (multiple machines):
LCP Frag watchdog : Checkpoint of table 4294967295 fragment 4294967295 too slow (no progress for > 60 s, state: LCP_WAIT_END_LCP).
m_curr_disk_write_speed: 170kb  m_words_written_this_period: 0 kwords  m_overflow_disk_write: 0 kb
m_reset_delay_used: 90  time since last RESET_DISK_SPEED: 63 millis
m_monitor_words_written : 0, duration : 368 millis, rate : 0 bytes/s : (0 pct of config)
BackupRecord 0:  BackupId: 85989  MasterRef: 12f70014  ClientRef: 0
 State: 2
 noOfByte: 1044720  noOfRecords: 4669
 noOfLogBytes: 0  noOfLogRecords: 0
 errorCode: 0
 file 19:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 18:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 17:  type: 3  flags: H'0  tableId: 60  fragmentId: 0
 file 16:  type: 3  flags: H'0  tableId: 149  fragmentId: 0
 file 15:  type: 3  flags: H'0  tableId: 226  fragmentId: 0
 file 14:  type: 3  flags: H'0  tableId: 226  fragmentId: 0
 file 13:  type: 3  flags: H'0  tableId: 18  fragmentId: 0
 file 12:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 11:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 10:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 9:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 8:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 7:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 6:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 5:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 4:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 3:  type: 3  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 2:  type: 1  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 1:  type: 1  flags: H'0  tableId: 4294967040  fragmentId: 0
 file 0:  type: 1  flags: H'0  tableId: 4294967040  fragmentId: 0

The hardware: SATA SSD 1.6T on a raid controller.

First issue is fixed in 7.6.5 - for the second issue it would be great to see the configuration. It looks like something that can be solved by reconfiguration. Likely a GCP stop. The LCP has stopped around finishing the LCP and it only waits to delete the last set of LCP files. To delete a file the GCP must reach
the proper GCI.

The only other option is that it takes forever to delete files which I presume is not the problem.

The config with HostName filtered out:

===========================================

[tcp default]
SendBufferMemory=2M
ReceiveBufferMemory=2M

[ndbd default]
NoOfReplicas= 2
DataDir= /var/lib/mysql-cluster
LockPagesInMainMemory=1
DataMemory=80G
ODirect=1
NoOfFragmentLogFiles=300
NoOfFragmentLogParts=12
MaxNoOfConcurrentOperations=200000
SchedulerSpinTimer=400
SchedulerExecutionTimer=100
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=200
TransactionDeadlockDetectionTimeout=10000
RedoBuffer=64M
CompressedBackup=1
MaxNoOfTables=1024
MaxNoOfAttributes=10000
MaxNoOfOrderedIndexes=768
MaxNoOfExecutionThreads=24
MaxNoOfFiredTriggers=8000

[ndb_mgmd]
NodeId= 49

[ndbd]
NodeId= 1
[ndbd]
NodeId= 2
[ndbd]
NodeId= 3
[ndbd]
NodeId= 4
[ndbd]
NodeId= 5
[ndbd]
NodeId= 6
[ndbd]
NodeId= 7
[ndbd]
NodeId= 8
[ndbd]
NodeId= 9
[ndbd]
NodeId= 10
[ndbd]
NodeId= 11
[ndbd]
NodeId= 12
[ndbd]
NodeId= 13
[ndbd]
NodeId= 14
[ndbd]
NodeId= 15
[ndbd]
NodeId= 16
[ndbd]
NodeId= 17
[ndbd]
NodeId= 18
[ndbd]
NodeId= 19
[ndbd]
NodeId= 20
[ndbd]
NodeId= 21
[ndbd]
NodeId= 22
[ndbd]
NodeId= 23
[ndbd]
NodeId= 24
[ndbd]
NodeId= 25
[ndbd]
NodeId= 26
[ndbd]
NodeId= 27
[ndbd]
NodeId= 28
[ndbd]
NodeId= 29
[ndbd]
NodeId= 30
[ndbd]
NodeId= 31
[ndbd]
NodeId= 32
[ndbd]
NodeId= 33
[ndbd]
NodeId= 34
[ndbd]
NodeId= 35
[ndbd]
NodeId= 36
[ndbd]
NodeId= 37
[ndbd]
NodeId= 38
[ndbd]
NodeId= 39
[ndbd]
NodeId= 40
[ndbd]
NodeId= 41
[ndbd]
NodeId= 42
[ndbd]
NodeId= 43
[ndbd]
NodeId= 44
[ndbd]
NodeId= 45
[ndbd]
NodeId= 46
[ndbd]
NodeId= 47
[ndbd]
NodeId= 48

[mysqld]
NodeId= 51
[mysqld]
NodeId= 52
[mysqld]
NodeId= 53
[mysqld]
NodeId= 54
[mysqld]
NodeId= 55
[mysqld]
NodeId= 56
[mysqld]
NodeId= 57
[mysqld]
NodeId= 58
[mysqld]
NodeId= 59
[mysqld]
NodeId= 60
[mysqld]
NodeId= 61
[mysqld]
NodeId= 62
[mysqld]
NodeId= 63
[mysqld]
NodeId= 64
[mysqld]
NodeId= 65
[mysqld]
NodeId= 66
[mysqld]
NodeId= 67
[mysqld]
NodeId= 68
[mysqld]
NodeId= 69
[mysqld]
NodeId= 70
[mysqld]
NodeId= 71
[mysqld]
NodeId= 72
[mysqld]
NodeId= 73
[mysqld]
NodeId= 74
[mysqld]
NodeId= 75
[mysqld]
NodeId= 76
[mysqld]
NodeId= 77
[mysqld]
NodeId= 78

===========================================

Looking at the config file there are no obvious reasons for the LCP watchdog timeout.
My suspicion is that it is the node failure handling that delays the GCP protocol.
It would be great if you could display a bit more output from the node log right
before the crash in the node that was master at the time of the crash. This node log
will have some output related to the node failure handling I think, could also be found
in cluster log possibly.

Node failure handling handles one node at a time, normally doesn't take much
time, but if there are many operations in process and many nodes involved it will
take a bit longer.

The workaround is to increase the timeout for LCP watchdogs.

Posted by developer:
 
I went through the logs provided.
The LCP watchdog failure is caused by a cluster failure.
The cluster crash is caused by the bug reporting:
CHECK BACKUPFORMAT::NDB_LCP_CTL_FILE_SIZE >= REAL_BYTES_READ FAILED
This is a known bug with a fix available.

This bug hits node 47 and node 48 and thus an entire node group
fails. For some reason the cluster failure is not discovered within
60 seconds. This is a bug in itself, it means that the
error logs show misleading messages, so it is the above bug
that is the only bug that needs fixing. I will look into logs
a bit more to see if I can understand why the cluster crash
isn't discovered much faster.

Fixed a long time ago under another bug.