Bug #88093 NDB Cluster: Internal program error (failed ndbrequire) - DBTC (Line: 8566) 0x00
Submitted: 14 Oct 2017 0:50 Modified: 19 Oct 2017 17:47
Reporter: Thomas Waibel-BGo Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Cluster: plugin Severity:S2 (Serious)
Version:mysql-5.7.19 ndb-7.5.7 OS:Debian (Debian 7 (Jessie))
Assigned to: Bogdan Kecman CPU Architecture:Any
Tags: disk based tables, High write load

[14 Oct 2017 0:50] Thomas Waibel-BGo
Description:
Hi,
seems like during a 
SET NDB_BATCH_SIZE=30*1024*1024;
ALTER TABLE schema1.tb1 ENGINE=ndbcluster TABLESPACE TS1 STORAGE DISK PARTITION BY KEY (key1); ANALYZE TABLE schema1.tb1;
one of our data nodes crashed.

Time: Saturday 14 October 2017 - 01:05:16
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data:
Error object: DBTC (Line: 8566) 0x00000000
Program: ndbmtd
Pid: 11232 thr: 14
Version: mysql-5.7.19 ndb-7.5.7
Trace file name: ndb_6_trace.log.11_t14
Trace file path: /var/lib/mysql-cluster/ndb_6_trace.log.11 [t1..t17]
***EOM**

How to repeat:
Not able to repeat up to now. Attached the ndb_error_reporter output.
[14 Oct 2017 0:54] Thomas Waibel-BGo
File of ndb_error_reporter is 19MByte - your system only allows 3MByte.

File is available here:
https://data.boerse-go.de/ndb_error_report_20171014015957.tar.bz2
[14 Oct 2017 9:34] Thomas Waibel-BGo
And I reproduced after the restart of the failed node.

Executing a script that contains those exact statements.
root@mysql07.dc1:~# mysql -pXXXXXXXXXXXXXXX instruments </tmp/mysql-alter-engine-ndb.sql
Warning: Using a password on the command line interface can be insecure.
ERROR 1297 (HY000) at line 2: Got temporary error 4031 'Node failure caused abort of transaction' from NDBCLUSTER

Line 2 is the same table as with the first crash
SET SQL_LOG_BIN = 0; SET NDB_BATCH_SIZE=30*1024*1024;
ALTER TABLE db1.table1 ENGINE=ndbcluster TABLESPACE TS1 STORAGE DISK PARTITION BY KEY (key1); ANALYZE TABLE db1.table1;

There were a lot of other tables we migrated before which worked without problems...
[15 Oct 2017 12:35] Thomas Waibel-BGo
I just left that table and continued migration the next tables. Works fine...
[16 Oct 2017 8:41] Thomas Waibel-BGo
Looking throught the source code and then the logs before the crash it seems the crash is related to https://bugs.mysql.com/bug.php?id=84986

The MySQL API node doing the alter table statement has a max_allowed_packet size of 32MByte.
[16 Oct 2017 9:27] Thomas Waibel-BGo
Uuups - wrong system
mysql> show variables like 'max%allowed%';
+--------------------+----------+
| Variable_name      | Value    |
+--------------------+----------+
| max_allowed_packet | 16777216 |
+--------------------+----------+
1 row in set (0.00 sec)

mysql>

So it is 16 MByte...
[16 Oct 2017 14:08] Thomas Waibel-BGo
Built a Test Environment with almost same settings (Reduced DataMemory, IndexMemory, MaxNoOfConcurrentTransactions, MaxNoOfConcurrentOperations, etc.)
and was able to crash a node.

After reducing the max_allowed_packet variable from 16M to 1M it still crashes...
[16 Oct 2017 14:16] Thomas Waibel-BGo
root@mysql06.dc1:~# sftp -oPort=2021 -oUser=XXXXXXXX sftp.oracle.com:/support/incoming
no matching mac found: client hmac-sha2-512-etm@openssh.com,hmac-sha2-256-etm@openssh.com,hmac-ripemd160-etm@openssh.com,umac-128-etm@openssh.com,hmac-sha2-512,hmac-sha2-256,hmac-ripemd160,umac-128@openssh.com server hmac-md5,hmac-sha1,hmac-md5-96,hmac-sha1-96,hmac-sha256,hmac-sha256@ssh.com
Couldn't read packet: Connection reset by peer
root@mysql06.dc1:~#

Häh? Why do you not just use ftp instead?????
[16 Oct 2017 14:19] Thomas Waibel-BGo
Test system ndb_error-reporter output

Attachment: ndb_error_report_20171016161152.tar.bz2 (application/octet-stream, text), 744.17 KiB.

[16 Oct 2017 18:29] Thomas Waibel-BGo
On the test system:
- Dropped the PRIMARY Key (product,timestamp) and the INDEX timestamp 
- Created a new PRIMARY KEY (timestamp,product) 
- ALTER TABLE t1 ENGINE=ndbcluster TABLESPACE TS1 STORAGE DISK PARTITION BY KEY (timestamp) - works
- Dropped the PRIMARY Key (timestamp,product) and created the old PRIMARY KEY (product,timestamp) again - works
- ALTER TABLE t1 PARTITION BY KEY (product) - works
- Add INDEX (timestamp) - works

So the table is migrated now - there was not one log entry like these before
sleeploop 10!! (Worker thread blocked (>= 10ms) by slow consumer threads)
[ndbd] WARNING  -- thr: 2: Overslept 2521 ms, expected ~10ms

This is on the test system - but I believe that works also in production - which I find very strange...
[17 Oct 2017 9:23] Thomas Waibel-BGo
The crash solely depends on the partitioning
---
root@mysql06.dc1:~# mysql --defaults-extra-file=/etc/mysql/user-root.cnf db1 -e 'SET SQL_LOG_BIN =0; ALTER TABLE t1 ENGINE=InnoDB TABLESPACE innodb_file_per_table;'
root@mysql06.dc1:~# mysql --defaults-extra-file=/etc/mysql/user-root.cnf db1 -e 'SET SQL_LOG_BIN = 0; SET NDB_BATCH_SIZE=30*1024*1024; ALTER TABLE t1 ENGINE=ndbcluster TABLESPACE TS1 STORAGE DISK PARTITION BY KEY (timestamp);'
root@mysql06.dc1:~# mysql --defaults-extra-file=/etc/mysql/user-root.cnf db1 -e 'SET SQL_LOG_BIN =0; ALTER TABLE t1 ENGINE=InnoDB TABLESPACE innodb_file_per_table;'
root@mysql06.dc1:~# mysql --defaults-extra-file=/etc/mysql/user-root.cnf db1 -e 'SET SQL_LOG_BIN = 0; SET NDB_BATCH_SIZE=30*1024*1024; ALTER TABLE t1 ENGINE=ndbcluster TABLESPACE TS1 STORAGE DISK PARTITION BY KEY (product);'
ERROR 1297 (HY000) at line 1: Got temporary error 286 'Node failure caused abort of transaction' from NDBCLUSTER
root@mysql06.dc1:~#
---

Log of the node having the failure
...
2017-10-17 11:12:37 [ndbd] WARNING  -- thr: 2: Overslept 4894 ms, expected ~10ms
2017-10-17 11:12:49 [ndbd] WARNING  -- thr: 2: Overslept 11569 ms, expected ~10ms
2017-10-17 11:13:06 [ndbd] WARNING  -- thr: 2: Overslept 17262 ms, expected ~10ms
2017-10-17 11:13:08 [ndbd] WARNING  -- thr: 2: Overslept 2235 ms, expected ~10ms
2017-10-17 11:13:21 [ndbd] WARNING  -- thr: 2: Overslept 1717 ms, expected ~10ms
2017-10-17 11:13:27 [ndbd] WARNING  -- thr: 2: Overslept 1551 ms, expected ~10ms
2017-10-17 11:14:06 [ndbd] INFO     -- /export/home/pb2/build/sb_0-23963488-1498206279.11/mysql-cluster-gpl-7.5.7/storage/ndb/src/kernel/blocks/dbdih/DbdihMain.cpp
2017-10-17 11:14:06 [ndbd] INFO     -- DBDIH (Line: 21676) 0x00000000 Check lcpId == ((Sysfile *)&sysfileData[0])->latestLCP_ID failed
2017-10-17 11:14:06 [ndbd] INFO     -- Error handler restarting system
2017-10-17 11:14:06 [ndbd] INFO     -- Error handler shutdown completed - exiting
2017-10-17 11:14:07 [ndbd] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
...
[17 Oct 2017 16:29] Bogdan Kecman
Hi,

It's not related to the other bug, the "Overslept" thing in logs show high cpu usage that can be related to "anything" really so that's not the clue here I'd say.

As for your ability to reproduce this on a test system, can you elaborate bit more on that? I can't reproduce this so this must have something to do with your data or table structure. Any chance you can share the table content, data snippet that's enough for you to crash it on test setup? Everything I try on a test setup with empty / randomly filled tables does not reproduce the problem.

best regards
Bogdan
[17 Oct 2017 21:11] Thomas Waibel-BGo
Hi Bogdan,

I am pretty sure it is triggered by the data in that table.

Table is available here:
https://data.boerse-go.de/ohlc_74_60.sql.lzop

I left the table details.
database db1 = instruments
tablename t1 = ohlc_74_60
column product = instrumentsId
column timestamp = ts

Table is on InnoDB

Command used to convert to ndbcluster:
mysql -p instruments -e 'SET SQL_LOG_BIN = 0; SET NDB_BATCH_SIZE=30*1024*1024; ALTER TABLE 'ohlc_74_60' ENGINE=ndbcluster TABLESPACE TS1 STORAGE DISK PARTITION BY KEY (instrumentId);'

That command should crash one node.

Best regards and thanks for having a look into that...
[19 Oct 2017 17:06] Bogdan Kecman
Hi,

I don't think this is a bug.

mysql> ALTER TABLE `ohlc_74_60` ENGINE=ndbcluster TABLESPACE ts1 STORAGE DISK PARTITION BY KEY (instrumentId);
Query OK, 4727627 rows affected (1 min 59.38 sec)
Records: 4727627 Duplicates: 0  Warnings: 0

So if I leave default NDB_BATCH_SIZE the alter finishes in 2 minutes.

If I change NDB_BATCH_SIZE according to your example, it's unable to finish on my test system after more then 30 hours, but there is no crash. This is just poor configuration for my test system. Finding proper configuration for your system is not that easy and that's why we have ready consulting and support team to help you out :) but is not a bug. Looking at your logs your system is overloaded when you do this meaning your configuration is improperly scaled.

Best regards
Bogdan
[19 Oct 2017 17:47] Thomas Waibel-BGo
Hi Bogdan,

yes, with ndb-batch-size  set to 32K it works here as well:

root@mysql06.dc1:~# mysql --defaults-extra-file=/etc/mysql/user-root.cnf instruments -e 'ALTER TABLE ohlc_74_60 ENGINE=InnoDB PARTITION BY KEY (instrumentId);'
root@mysql06.dc1:~# mysql --defaults-extra-file=/etc/mysql/user-root.cnf instruments -e 'SET NDB_BATCH_SIZE=32*1024; ALTER TABLE 'ohlc_74_60' ENGINE=ndbcluster TABLESPACE TS1 STORAGE DISK PARTITION BY KEY (instrumentId);'
root@mysql06.dc1:~#

Takes longer though than with the other tables with around the same size...