MySQL Bugs: #116507: Internal program error (failed ndbrequire) in NDB cluster

Bug #116507	Internal program error (failed ndbrequire) in NDB cluster
Submitted:	30 Oct 2024 15:16	Modified:	27 Nov 2024 1:24
Reporter:	CunDi Fang	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.40-cluster MySQL Cluster Community S	OS:	Any
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
In a 1 manager(node 1), 4 ndbd(node 2-5), 2 sql(node 6,7) cluster, 

Time: Friday 25 October 2024 - 11:37:39
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: Backup.cpp
Error object: BACKUP (Line: 7297) 0x00000002 Check findTable(ptr, tabPtr, tableId) failed
Program: ndbd
Pid: 3218
Version: mysql-8.0.40 ndb-8.0.40
Trace file name: ndb_5_trace.log.1
Trace file path: /usr/local/mysql-cluster/data/ndb_5_trace.log.1 [t1..t1]
***EOM***

How to repeat:
Here is the log file:
```
2024-10-25 11:37:35 [ndbd] INFO     -- LDM(0): We have completed restoring our fragments and executed REDO log and rebuilt or
dered indexes
2024-10-25 11:37:35 [ndbd] INFO     -- Restore Database Off-line Completed
2024-10-25 11:37:35 [ndbd] INFO     -- Bring Database On-line Starting
2024-10-25 11:37:35 [ndbd] INFO     -- LDM(0): Starting to copy fragments.
2024-10-25 11:37:35 [ndbd] INFO     -- LDM(0): Completed copy of fragment T2F1. Changed +0/-1155 rows, 64680 bytes. 11550 pct churn to 10 rows.
2024-10-25 11:37:35 [ndbd] INFO     -- (0)Starting local LCP(22,1)
2024-10-25 11:37:35 [ndbd] INFO     -- LDM(0): Completed copy of fragment T2F3. Changed +0/-1156 rows, 64736 bytes. 12844 pct
 churn to 9 rows.
...a lot of copy...
2024-10-25 11:37:35 [ndbd] INFO     -- LDM(0): Completed copy of fragment T184F3. Changed +10/-1010 rows, 57440 bytes. 10200 pct churn to 10 rows.
2024-10-25 11:37:35 [ndbd] INFO     -- Bring Database On-line Completed
2024-10-25 11:37:35 [ndbd] INFO     -- Starting REDO logging
2024-10-25 11:37:37 [ndbd] INFO     -- Make On-line Database recoverable by waiting for LCP Starting, all parallel threads ha
ve now ceased their activity and we have a single wait state here
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
Base address/slide: 0x5cae9b572000
With use of addr2line, llvm-symbolizer, or, atos, subtract the addresses in
stacktrace with the base address before passing them to tool.
For tools that have options for slide use that, e.g.:
llvm-symbolizer --adjust-vma=0x5cae9b572000 ...
atos -s 0x5cae9b572000 ...
stack_bottom = 0 thread_stack 0x0
ndbd(my_print_stacktrace(unsigned char const*, unsigned long)+0x41) [0x5cae9baffbd1]
ndbd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x53) [0x5cae9ba58773]
ndbd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xfc) [0x5cae9baf02bc]
ndbd(Backup::execBACKUP_FRAGMENT_REQ(Signal*)+0x4a9) [0x5cae9b9fd2e9]
ndbd(+0x5f4680) [0x5cae9bb66680]
ndbd(FastScheduler::doJob(unsigned int)+0x115) [0x5cae9baea825]
ndbd(ThreadConfig::ipControlLoop(NdbThread*)+0x787) [0x5cae9bb01627]
ndbd(ndbd_run(bool, int, char const*, int, char const*, bool, bool, bool, unsigned int, int, int, unsigned long)+0x7e1) [0x5cae9b6b1361]
ndbd(real_main(int, char**)+0x4f5) [0x5cae9b6b19c5]
ndbd(angel_run(char const*, Vector<BaseString> const&, char const*, int, char const*, bool, bool, bool, int, int)+0x11c5) [0x5cae9b6b3055]
ndbd(real_main(int, char**)+0x87e) [0x5cae9b6b1d4e]
ndbd(+0x129982) [0x5cae9b69b982]
/lib/x86_64-linux-gnu/libc.so.6(+0x29d90) [0x73eb98b91d90]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0x80) [0x73eb98b91e40]
ndbd(_start+0x25) [0x5cae9b6a4ff5]
2024-10-25 11:37:39 [ndbd] INFO     -- /home/mysql-cluster-gpl-8.0.40/storage/ndb/src/kernel/blocks/backup/Backup.cpp
2024-10-25 11:37:39 [ndbd] INFO     -- BACKUP (Line: 7297) 0x00000002 Check findTable(ptr, tabPtr, tableId) failed
2024-10-25 11:37:39 [ndbd] INFO     -- Error handler shutting down system
2024-10-25 11:37:39 [ndbd] INFO     -- Error handler shutdown completed - exiting
2024-10-25 11:37:39 [ndbd] ALERT    -- Node 5: Forced node shutdown completed. Occurred during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2024-10-25 12:09:19 [ndbd] INFO     -- Angel pid: 3281 started child: 3282
2024-10-25 12:09:19 [ndbd] INFO     -- Normal start of data node using checkpoint and log info if existing
2024-10-25 12:09:19 [ndbd] INFO     -- Configuration fetched from '192.168.10.8:1186', generation: 1
2024-10-25 12:09:19 [ndbd] INFO     -- Changing directory to '/usr/local/mysql-cluster/data'
2024-10-25 12:09:19 [ndbd] INFO     -- SchedulerSpinTimer = 0
2024-10-25 12:09:19 [ndbd] INFO     -- AutomaticThreadConfig = 0, NumCPUs = 0
2024-10-25 12:09:19 [ndbd] INFO     -- MaxNoOfTriggers set to 1400
2024-10-25 12:09:19 [ndbd] INFO     -- NDBMT: non-mt
2024-10-25 12:09:19 [ndbd] WARNING  -- Running ndbd with a single thread of signal execution.  For multi-threaded signal execution run the ndbmtd binary.
```

Suggested fix:
The source of this error is a failed check of the findTable(ptr, tabPtr, tableId) function in the Backup.cpp file. Specifically, the code attempts to find a specified table through the findTable function, but the table or table pointer is not found during the check, thus triggering an ndbrequire failure and returning error code 2341.

From the sequence of operations in the logs, it appears that the node first performed fragmentation recovery and rebuilt the indexes, then brought the database back online and started REDO logging to ensure that the database was recoverable. At this point, the logs indicate that multiple fragment recovery operations are being performed in parallel, implying that there may have been some level of data inconsistency during the backup or recovery process, and perhaps a problem with the references to the tables while the fragments were being recovered.

Hi,

Yes, please, we need full logs. Is this bare metal or vm or docker setup?

This can be caused by corrupted file system. 

Thanks

Hi,

I cannot reproduce this.