Bug #91161 NDB crash on Check createTableVersion == lqhCreateTableVersion || lqhCreateTable
Submitted: 6 Jun 2018 14:31 Modified: 13 Jun 2018 3:45
Reporter: Daniël van Eeden (OCA) Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:7.6.6 OS:Any
Assigned to: Bogdan Kecman CPU Architecture:Any

[6 Jun 2018 14:31] Daniël van Eeden
Description:
2018-06-06 16:16:55 [ndbd] INFO     -- /export/home2/pb2/build/sb_1-28710439-1526985206.38/rpm/BUILD/mysql-cluster-gpl-7.6.6/mysql-cluster-gpl-7.6.6/storage/ndb/src/kernel/blocks/backup/Ba
ckup.cpp
2018-06-06 16:16:55 [ndbd] INFO     -- BACKUP (Line: 11209) 0x00000002 Check createTableVersion == lqhCreateTableVersion || lqhCreateTableVersion == 0 failed
2018-06-06 16:16:55 [ndbd] INFO     -- Error handler shutting down system
2018-06-06 16:16:56 [ndbd] INFO     -- Error handler shutdown completed - aborting
2018-06-06 16:17:10 [ndbd] INFO     -- Child process terminated by signal 6
2018-06-06 16:17:10 [ndbd] ALERT    -- Node 26: Forced node shutdown completed. Initiated by signal 6. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, pro
gramming error or missing error message, please report a bug). Temporary error, restart node'.

Time: Wednesday 6 June 2018 - 16:16:55
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: Backup.cpp
Error object: BACKUP (Line: 11209) 0x00000002 Check createTableVersion == lqhCreateTableVersion || lqhCreateTableVersion == 0 failed
Program: ndbmtd
Pid: 49603 thr: 7
Version: mysql-5.7.22 ndb-7.6.6
Trace file name: ndb_26_trace.log.18_t7
Trace file path: /var/lib/mysql-cluster/ndb_26_trace.log.18 [t1..t17]
***EOM***

How to repeat:
Unknown yet. Happened on startup of a cluster for a 7.6.4 to 7.6.6 upgrade.
[11 Jun 2018 13:20] Bogdan Kecman
Hi Daniel,

> Happened on startup of a cluster for a 7.6.4 to 7.6.6 upgrade.

How was the upgrade performed? This looks like it happened during "START BACKUP" and not during upgrade?

all best
Bogdan
[12 Jun 2018 20:18] Daniël van Eeden
I started to do a rolling upgrade.

From memory:
I started with node 48, which went ok.
Then I did node 47, which crashed (maybe Bug #90606)
Then I did node 46, which went ok.
Then replication to the cluster stopped (see Bug #91160).
Then I did a full cluster shutdown and start. I think that was when this crash happened on node 26.

I think I had to start node 47 and one other node (26?) with --initial to get the cluster 100% healty again.
[13 Jun 2018 3:45] Bogdan Kecman
Hi,
Thanks for the explanation. The new code have some additional logging so if this happens again (with newer code) we'll have more info to go on, but with info we have now there's nothing we can do to find out why it happened and I can't reproduce this using your flow.

all best
Bogdan
[14 Jun 2018 5:29] Mikael Ronström
This can happen in the following scenario:
1) Node X stops
2) Drop table
3) Cluster restart including start of node X
4) Create table
5) Crash

The reason is that the Drop table isn't performed as part of the cluster restart in node X
and this leaves LCP files from the old table remaining and when this table id is reused by
a new table we discover an anomaly in the LCP execution code.

We will do two solutions to this problem.

The short-term solution is to simply not crash in this situation, but simply remove the
old files and create new ones.

The long-term solution is to ensure that cluster restarts cleans up properly old dropped tables
even when the node was down while the drop table occurred.