Bug #90175 | Check BackupFormat::NDB_LCP_CTL_FILE_SIZE >= real_bytes_read failed | ||
---|---|---|---|
Submitted: | 22 Mar 2018 11:22 | Modified: | 8 Oct 2019 9:43 |
Reporter: | Daniël van Eeden (OCA) | Email Updates: | |
Status: | Duplicate | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 7.6.4 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[22 Mar 2018 11:22]
Daniël van Eeden
[22 Mar 2018 11:40]
Daniël van Eeden
From the logs (multiple machines): LCP Frag watchdog : Checkpoint of table 4294967295 fragment 4294967295 too slow (no progress for > 60 s, state: LCP_WAIT_END_LCP). m_curr_disk_write_speed: 170kb m_words_written_this_period: 0 kwords m_overflow_disk_write: 0 kb m_reset_delay_used: 90 time since last RESET_DISK_SPEED: 63 millis m_monitor_words_written : 0, duration : 368 millis, rate : 0 bytes/s : (0 pct of config) BackupRecord 0: BackupId: 85989 MasterRef: 12f70014 ClientRef: 0 State: 2 noOfByte: 1044720 noOfRecords: 4669 noOfLogBytes: 0 noOfLogRecords: 0 errorCode: 0 file 19: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 18: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 17: type: 3 flags: H'0 tableId: 60 fragmentId: 0 file 16: type: 3 flags: H'0 tableId: 149 fragmentId: 0 file 15: type: 3 flags: H'0 tableId: 226 fragmentId: 0 file 14: type: 3 flags: H'0 tableId: 226 fragmentId: 0 file 13: type: 3 flags: H'0 tableId: 18 fragmentId: 0 file 12: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 11: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 10: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 9: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 8: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 7: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 6: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 5: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 4: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 3: type: 3 flags: H'0 tableId: 4294967040 fragmentId: 0 file 2: type: 1 flags: H'0 tableId: 4294967040 fragmentId: 0 file 1: type: 1 flags: H'0 tableId: 4294967040 fragmentId: 0 file 0: type: 1 flags: H'0 tableId: 4294967040 fragmentId: 0 The hardware: SATA SSD 1.6T on a raid controller.
[23 Mar 2018 13:32]
Bernd Ocklin
First issue is fixed in 7.6.5 - for the second issue it would be great to see the configuration. It looks like something that can be solved by reconfiguration. Likely a GCP stop. The LCP has stopped around finishing the LCP and it only waits to delete the last set of LCP files. To delete a file the GCP must reach the proper GCI. The only other option is that it takes forever to delete files which I presume is not the problem.
[23 Mar 2018 15:50]
Daniël van Eeden
The config with HostName filtered out: =========================================== [tcp default] SendBufferMemory=2M ReceiveBufferMemory=2M [ndbd default] NoOfReplicas= 2 DataDir= /var/lib/mysql-cluster LockPagesInMainMemory=1 DataMemory=80G ODirect=1 NoOfFragmentLogFiles=300 NoOfFragmentLogParts=12 MaxNoOfConcurrentOperations=200000 SchedulerSpinTimer=400 SchedulerExecutionTimer=100 TimeBetweenGlobalCheckpoints=1000 TimeBetweenEpochs=200 TransactionDeadlockDetectionTimeout=10000 RedoBuffer=64M CompressedBackup=1 MaxNoOfTables=1024 MaxNoOfAttributes=10000 MaxNoOfOrderedIndexes=768 MaxNoOfExecutionThreads=24 MaxNoOfFiredTriggers=8000 [ndb_mgmd] NodeId= 49 [ndbd] NodeId= 1 [ndbd] NodeId= 2 [ndbd] NodeId= 3 [ndbd] NodeId= 4 [ndbd] NodeId= 5 [ndbd] NodeId= 6 [ndbd] NodeId= 7 [ndbd] NodeId= 8 [ndbd] NodeId= 9 [ndbd] NodeId= 10 [ndbd] NodeId= 11 [ndbd] NodeId= 12 [ndbd] NodeId= 13 [ndbd] NodeId= 14 [ndbd] NodeId= 15 [ndbd] NodeId= 16 [ndbd] NodeId= 17 [ndbd] NodeId= 18 [ndbd] NodeId= 19 [ndbd] NodeId= 20 [ndbd] NodeId= 21 [ndbd] NodeId= 22 [ndbd] NodeId= 23 [ndbd] NodeId= 24 [ndbd] NodeId= 25 [ndbd] NodeId= 26 [ndbd] NodeId= 27 [ndbd] NodeId= 28 [ndbd] NodeId= 29 [ndbd] NodeId= 30 [ndbd] NodeId= 31 [ndbd] NodeId= 32 [ndbd] NodeId= 33 [ndbd] NodeId= 34 [ndbd] NodeId= 35 [ndbd] NodeId= 36 [ndbd] NodeId= 37 [ndbd] NodeId= 38 [ndbd] NodeId= 39 [ndbd] NodeId= 40 [ndbd] NodeId= 41 [ndbd] NodeId= 42 [ndbd] NodeId= 43 [ndbd] NodeId= 44 [ndbd] NodeId= 45 [ndbd] NodeId= 46 [ndbd] NodeId= 47 [ndbd] NodeId= 48 [mysqld] NodeId= 51 [mysqld] NodeId= 52 [mysqld] NodeId= 53 [mysqld] NodeId= 54 [mysqld] NodeId= 55 [mysqld] NodeId= 56 [mysqld] NodeId= 57 [mysqld] NodeId= 58 [mysqld] NodeId= 59 [mysqld] NodeId= 60 [mysqld] NodeId= 61 [mysqld] NodeId= 62 [mysqld] NodeId= 63 [mysqld] NodeId= 64 [mysqld] NodeId= 65 [mysqld] NodeId= 66 [mysqld] NodeId= 67 [mysqld] NodeId= 68 [mysqld] NodeId= 69 [mysqld] NodeId= 70 [mysqld] NodeId= 71 [mysqld] NodeId= 72 [mysqld] NodeId= 73 [mysqld] NodeId= 74 [mysqld] NodeId= 75 [mysqld] NodeId= 76 [mysqld] NodeId= 77 [mysqld] NodeId= 78 ===========================================
[28 Mar 2018 17:34]
Mikael Ronström
Looking at the config file there are no obvious reasons for the LCP watchdog timeout. My suspicion is that it is the node failure handling that delays the GCP protocol. It would be great if you could display a bit more output from the node log right before the crash in the node that was master at the time of the crash. This node log will have some output related to the node failure handling I think, could also be found in cluster log possibly. Node failure handling handles one node at a time, normally doesn't take much time, but if there are many operations in process and many nodes involved it will take a bit longer. The workaround is to increase the timeout for LCP watchdogs.
[9 Apr 2018 22:42]
Mikael Ronström
Posted by developer: I went through the logs provided. The LCP watchdog failure is caused by a cluster failure. The cluster crash is caused by the bug reporting: CHECK BACKUPFORMAT::NDB_LCP_CTL_FILE_SIZE >= REAL_BYTES_READ FAILED This is a known bug with a fix available. This bug hits node 47 and node 48 and thus an entire node group fails. For some reason the cluster failure is not discovered within 60 seconds. This is a bug in itself, it means that the error logs show misleading messages, so it is the above bug that is the only bug that needs fixing. I will look into logs a bit more to see if I can understand why the cluster crash isn't discovered much faster.
[8 Oct 2019 9:43]
Mikael Ronström
Fixed a long time ago under another bug.