MySQL Bugs: #98526: Forced shutdown at end of process when data node tries to join cluster

Bug #98526	Forced shutdown at end of process when data node tries to join cluster
Submitted:	9 Feb 2020 21:26	Modified:	31 Mar 2020 20:14
Reporter:	Shawn Hogan	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	7.6.13	OS:	SUSE
Assigned to:		CPU Architecture:	x86

Description:
Data node has a forced shutdown at the end of it's process of joining a cluster with the following:

Error: 2341
Error data: DblqhMain.cpp
Error object: DBLQH (Line: 7222) 0x00000002 Check regTcPtr.p->m_dealloc_data.m_dealloc_ref_count == 1 failed
Program: ndbmtd
Pid: 7840 thr: 5
Version: mysql-5.7.29 ndb-7.6.13

The end of the out.log for the node:

2020-02-09 13:05:43 [ndbd] INFO     -- LDM(3): Completed copy of fragment T2121F8. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2020-02-09 13:05:43 [ndbd] INFO     -- LDM(3): Completed copy of fragment T2121F24. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2020-02-09 13:05:43 [ndbd] INFO     -- LDM(3): Completed copy of fragment T2122F8. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2020-02-09 13:05:43 [ndbd] INFO     -- LDM(3): Completed copy of fragment T2122F24. Changed +0/-0 rows, 0 bytes. 0 pct churn to 0 rows.
2020-02-09 13:05:43 [ndbd] INFO     -- LDM(3): Completed copy of fragment T2123F8. Changed +1/-1019 rows, 57360 bytes. 102000 pct churn to 1 rows.
2020-02-09 13:05:43 [ndbd] INFO     -- /export/home/pb2/build/sb_0-37345881-1576895655.5/rpm/BUILD/mysql-cluster-gpl-7.6.13/mysql-cluster-gpl-7.6.13/storage/ndb/src/kernel/blocks/dblqh/DblqhMain.cpp
2020-02-09 13:05:43 [ndbd] INFO     -- DBLQH (Line: 7222) 0x00000002 Check regTcPtr.p->m_dealloc_data.m_dealloc_ref_count == 1 failed
2020-02-09 13:05:43 [ndbd] INFO     -- Error handler shutting down system
2020-02-09 13:05:43 [ndbd] INFO     -- Error handler shutdown completed - exiting
2020-02-09 13:05:43 [ndbd] ALERT    -- Node 11: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

This happened to 2 different nodes today, and I'm scared to try again with either of them (or even do an --initial) because last time this happened it caused all the data nodes to fail in a beautiful cascading fashion.

This is a "fresh" cluster that had to be restored from backup due to this:  https://forums.mysql.com/read.php?25,683988

Cluster is online and usable (just minus these last 2 data nodes).

How to repeat:
Start a data node and let it try to sync itself into cluster?  Although too scared to do that because something similar happened in the past and "trying again" caused the entire cluster to fail.

Noticed management node logs have a whole lot about that failure handling of node 11 hasn't been completed...

2020-02-09 14:47:19 [MgmtSrvr] WARNING  -- Node 17: Failure handling of node 11 has not completed in 6059 seconds - state = 6
2020-02-09 14:47:20 [MgmtSrvr] WARNING  -- Node 15: Failure handling of node 11 has not completed in 6059 seconds - state = 6
2020-02-09 14:47:20 [MgmtSrvr] WARNING  -- Node 18: Failure handling of node 11 has not completed in 6059 seconds - state = 6
2020-02-09 14:47:20 [MgmtSrvr] WARNING  -- Node 13: Failure handling of node 11 has not completed in 6059 seconds - state = 6
2020-02-09 14:47:20 [MgmtSrvr] WARNING  -- Node 16: Failure handling of node 11 has not completed in 6059 seconds - state = 6
2020-02-09 14:47:23 [MgmtSrvr] WARNING  -- Node 14: Failure handling of node 11 has not completed in 6059 seconds - state = 6

Which is probably why I also can't do a backup now...

ndb_mgm> start backup
Waiting for completed, this may take several minutes
Backup failed
*  3001: Could not start backup
*        Sequence failure: Permanent error: Internal error
ndb_mgm> Node 15: Backup request from 1 failed to start. Error: 1304

Through all this, I realized that maybe I haven't been able to start a data node successfully without --initial in both 7.6.13 as well as 8.0.19.  With 7.6.12 I definitely was able to do normal rolling restarts without issue.  But I'm also scared to try to downgrade from 7.6.13 to 7.6.12 at this point (can't do a backup and not particularly wanting to test if retrying the start of data node 11 or 12 will take down the whole cluster [again]).

Thanks for submitting the bug report.

I have looked at the uploaded logs, and it seems that this is a bug in the node recovery code, related to synchronizing data when a table's primary key includes column(s) with non-binary comparisons (e.g. case insensitive collations on varchar etc).

In both of the uploaded crashes, the table in question is table 2117, and it appears that there is perhaps e.g. an update to the primary key of the table, where the old and new values are :
  - The same according to the collation
  - Different on a binary level

e.g. if there were a case-insensitive varchar as the primary key, then a statement like this might be involved :

  update table_2117 set pk="AAA" where pk="aaa";

Pending a bug fix, I guess you could try to workaround or avoid this problem, and restore redundancy.

You can determine which table is #2117 using either the ndb_show_tables command, or e.g. 
  select * from ndbinfo.dict_obj_info where id=2117;

Perhaps the queries on that table could be modified, or the load reduced to allow the data nodes to startup as a workaround?

Sorry for these difficulties.

That makes sense.  When the cluster could no longer perform backups or do failure recovery (even a disconnected sql node couldn’t reconnect because it’s ids would never be freed up), I decided to force an entire cluster shutdown (I figured it would be better if I had to restore a backup from a recent backup vs. wait a few days without a backup).  And since the entire cluster had to come down, there was no query load possibly (reads or writes) until it was back online (at which point everything seems to be working fine).

I’m fairly certain I know the table in question, just to be clear, in the future if we need to do rolling restarts or anything, we should stop all writes to that table first to be safe?  Not ideal of course, but the alternate of having a failed cluster isn’t good.  Hah

Documented fix as follows in the NDB 7.6.15 and 8.0.21 changelogs:

    Incorrect handling of operations on replicas during node
    restarts could result in a forced shutdown, or in content
    diverging between replicas, when primary keys with nonbinary
    (case-sensitive) equality conditions were used.

Closed.

Also fixed in NDB 7.5.19.