MySQL Bugs: #105010: Forced shutdown of data node immediately after completion of restart

Bug #105010	Forced shutdown of data node immediately after completion of restart
Submitted:	22 Sep 2021 16:31	Modified:	23 Sep 2021 16:50
Reporter:	Shawn Hogan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	8.0.26	OS:	SUSE
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
In reference to this:  https://forums.mysql.com/read.php?25,698929,698929

Decided to move our cluster to 8.x and after what appeared to be a successful rolling restart of the first data node (took 90 minutes, but that's normal since each data node has 180G allocated to it), we were immediately met with it being shut down (4 seconds after it started).

2021-09-21 17:08:26 [ndbd] INFO -- starting
2021-09-21 17:08:26 [ndbd] INFO -- Start phase 101 completed
2021-09-21 17:08:26 [ndbd] INFO -- Phase 101 was used by SUMA to take over responsibility for sending some of the asynchronous change events
2021-09-21 17:08:26 [ndbd] INFO -- Node started
2021-09-21 17:08:26 [ndbd] INFO -- Node 11 has completed its restart
For help with below stacktrace consult:
https://dev.mysql.com/doc/refman/en/using-stack-trace.html
Also note that stack_bottom and thread_stack will always show up as zero.
stack_bottom = 0 thread_stack 0x0
/usr/sbin/ndbmtd(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x8cf4be]
/usr/sbin/ndbmtd(ndb_print_stacktrace()+0x45) [0x877f55]
/usr/sbin/ndbmtd(ErrorReporter::handleError(int, char const*, char const*, NdbShutdownType)+0x20) [0x82a810]
/usr/sbin/ndbmtd(SimulatedBlock::progError(int, int, char const*, char const*) const+0xf9) [0x891ae9]
/usr/sbin/ndbmtd(Dbtc::sendlqhkeyreq(Signal*, unsigned int, Dbtc::CacheRecord*, Dbtc::ApiConnectRecord*)+0x541) [0x69d541]
/usr/sbin/ndbmtd(Dbtc::packLqhkeyreq(Signal*, unsigned int, Ptr<Dbtc::CacheRecord>, Ptr<Dbtc::ApiConnectRecord>)+0x35) [0x6d4055]
/usr/sbin/ndbmtd(Dbtc::attrinfoDihReceivedLab(Signal*, Ptr<Dbtc::CacheRecord>, Ptr<Dbtc::ApiConnectRecord>)+0x135) [0x6d6f55]
/usr/sbin/ndbmtd(Dbtc::execTCKEYREQ(Signal*)+0x8ba) [0x6d909a]
/usr/sbin/ndbmtd() [0x89f898]
/usr/sbin/ndbmtd() [0x8a46ea]
/usr/sbin/ndbmtd(mt_job_thread_main+0x4ed) [0x8aa4fd]
/usr/sbin/ndbmtd() [0x872b36]
/lib64/libpthread.so.0(+0x84f9) [0x7f403f5094f9]
/lib64/libc.so.6(clone+0x3f) [0x7f403de06f2f]
2021-09-21 17:08:30 [ndbd] INFO -- /var/lib/pb2/sb_1-3697723-1625149027.99/rpm/BUILD/mysql-cluster-gpl-8.0.26/mysql-cluster-gpl-8.0.26/storage/ndb/src/kernel/blocks/dbtc/DbtcMain.cpp
2021-09-21 17:08:30 [ndbd] INFO -- DBTC (Line: 4879) 0x00000002 Check refToMain(TBRef) == 0xF7 failed
2021-09-21 17:08:30 [ndbd] INFO -- Error handler shutting down system
2021-09-21 17:08:32 [ndbd] ALERT -- Node 11: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

How to repeat:
Do a rolling restart of data node as part of process of upgrading from 7.6 to 8.0 on a live/in-use cluster.

Suggested fix:
According to Mikael Ronström:

Your options are the following.
1) Upgrade using a Cluster restart
This will ensure that no DBSPJ queries will happen during the upgrade
of the cluster
2) Upgrade as now, but ensure that no MySQL Server is doing any pushdown
joins while upgrading (there is a configuration parameter in the MySQL
servers to set this. It is ok to handle DBSPJ queries when all nodes
have upgraded.
3) Fix the binary by simply removing the line with ndbrequire and compile a
new binary.
4) Wait for a release that has a fix for this.

As for option 2) 
"2) Upgrade as now, but ensure that no MySQL Server is doing any pushdown
joins while upgrading (there is a configuration parameter in the MySQL
servers to set this. It is ok to handle DBSPJ queries when all nodes
have upgraded."

this is the parameter to change dynamically: ndb_join_pushdown
 
https://dev.mysql.com/doc/mysql-cluster-excerpt/8.0/en/mysql-cluster-system-variables.html...

Hi Shawn

I verified the behavior. The workaround you already have thanks to Mikael.

I myself jumping between the major versions like to backup and then [1] restart upgrade but disabling pushdown [2] will work too. First one will be faster but with some downtime, second can in theory be done without downtime :).

Thanks for the report

Documented fix as follows in the NDB 8.0.28 changelog:

    Following the rolling restart of a data node performed as part
    of an upgrade from NDB 7.6 to NDB 8.0, the data node underwent a
    forced shutdown. We fix this by allowing LQHKEYREQ to be sent to
    both the DBLQH and the DBSPJ NDB kernel blocks.

Closed.