MySQL Bugs: #116421: BACKUP Check findTable failed, can not restart bug in NDB cluster

Bug #116421	BACKUP Check findTable failed, can not restart bug in NDB cluster
Submitted:	19 Oct 2024 16:46	Modified:	24 Oct 2024 7:42
Reporter:	CunDi Fang	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.35-cluster MySQL Cluster Community S	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
here it is:

Current byte-offset of file-pointer is: 1067

Time: Friday 18 October 2024 - 18:21:25
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: Backup.cpp
Error object: BACKUP (Line: 7269) 0x00000002 Check findTable(ptr, tabPtr, tableId) failed
Program: ndbd
Pid: 103403
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_5_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_5_trace.log.1 [t1..t1]
***EOM***

How to repeat:
This is how it is recorded in the log:

2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T193F1. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T193F3. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T194F1. Changed +0/-1011 rows, 56616 bytes. 11233 pct churn to 9 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T194F3. Changed +0/-1012 rows, 56672 bytes. 12650 pct churn to 8 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T195F1. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T195F3. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T196F1. Changed +0/-1011 rows, 56616 bytes. 11233 pct churn to 9 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T196F3. Changed +0/-1012 rows, 56672 bytes. 12650 pct churn to 8 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T198F1. Changed +0/-611 rows, 34216 bytes. 3818 pct churn to 16 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T198F3. Changed +0/-617 rows, 34552 bytes. 6170 pct churn to 10 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T199F1. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- LDM(0): Completed copy of fragment T199F3. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2024-10-18 18:21:19 [ndbd] INFO     -- Bring Database On-line Completed
2024-10-18 18:21:19 [ndbd] INFO     -- Starting REDO logging
2024-10-18 18:21:21 [ndbd] INFO     -- Make On-line Database recoverable by waiting for LCP Starting, all parallel threads have now ceased their activity and we have a single wait state here
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
Base address/slide: 0x63d6d52ac000
With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in
stacktrace with the base address before passing them to tool.
For tools that have options for slide use that, e.g.:
llvm-symbolizer --adjust-vma=0x63d6d52ac000 ...
atos -s 0x63d6d52ac000 ...
stack_bottom = 0 thread_stack 0x0
ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x63d6d5802761]
ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x53) [0x63d6d575fef3]
ndbd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xfc) [0x63d6d57eefac]
ndbd(Backup::execBACKUP_FRAGMENT_REQ(Signal*)+0x4a9) [0x63d6d5721f29]
ndbd(+0x5ba5d0) [0x63d6d58665d0]
ndbd(FastScheduler::doJob(unsigned int)+0x115) [0x63d6d57ea865]
ndbd(ThreadConfig::ipControlLoop(NdbThread*)+0x787) [0x63d6d58003b7]
ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x1248) [0x63d6d53e38e8]
ndbd(real_main(int, char**)+0x4f5) [0x63d6d53e40b5]
ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0x11c5) [0x63d6d53e5745]
ndbd(real_main(int, char**)+0x87e) [0x63d6d53e443e]
ndbd(+0x1223a2) [0x63d6d53ce3a2]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x7721d89ecd90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x7721d89ece40]
ndbd(_start+0x25) [0x63d6d53d7665]
2024-10-18 18:21:25 [ndbd] INFO     -- /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/blocks/backup/Backup.cpp
2024-10-18 18:21:25 [ndbd] INFO     -- BACKUP (Line: 7269) 0x00000002 Check findTable(ptr, tabPtr, tableId) failed
2024-10-18 18:21:25 [ndbd] INFO     -- Error handler shutting down system
2024-10-18 18:21:25 [ndbd] INFO     -- Error handler shutdown completed - exiting
2024-10-18 18:21:25 [ndbd] ALERT    -- Node 5: Forced node shutdown completed. Occurred during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

I'll provide all the log files I can find.

Hi,

I cannot reproduce this. This looks like a disk corruption and not a bug

I'm not sure why this is happening, I checked the disk and didn't find anything wrong with it. What I found seems to be that something happened during the data replication process that caused a table not to be found, which ultimately caused the replication to fail. I found the sql statement that node5 was executing at the time, I don't know if that helps.

In the meantime, I looked at the logs of the manager node and found that nothing was being performed at that time that would cause the cluster state to change, such as node joins, node restarts, etc. So it's strange why the data replication is happening. Maybe this is a point to start with? I can give you the sql sequence executed by all nodes at that time if you want.