Bug #68824 alter online table ... reorganize partition crashes NDB data node
Submitted: 1 Apr 2013 14:20 Modified: 18 Apr 2016 13:38
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.5.29-7.2.10 OS:Linux (Centos 5.6)
Assigned to: MySQL Verification Team CPU Architecture:Any
Tags: alter online table, Data Node crash, MySQL Cluster, ndbmtd

[1 Apr 2013 14:20] Patrick Zoblisein
2 management nodes
2 sql nodes
6 ndbmtd data nodes

Cluster started out with 4 data nodes - have since added two more data nodes and now need to repartition the tables.

`alter online table ... reorganize partition` is causing random ndbmtd data node failures on a system with sysbench writing to it.

sysbench command below works fine with no `alter online table` in progress.
`alter online table` works fine with no sysbench in progress.

Failed node errorlog:
sendbufferpool waiting for lock, contentions: 9200 spins: 2060643
send lock node 19 waiting for lock, contentions: 3400 spins: 3137584
jbalock thr: 0 waiting for lock, contentions: 85000 spins: 12297789
2013-04-01 12:53:22 [ndbd] INFO     -- /pb2/build/sb_0-7932439-1355951739.99/mysql-cluster-gpl-7.2.10/storage/ndb/src/kernel/blocks/trix/Trix.cpp
2013-04-01 12:53:22 [ndbd] INFO     -- TRIX (Line: 766) 0x00000002
2013-04-01 12:53:22 [ndbd] INFO     -- Error handler shutting down system
2013-04-01 12:53:22 [ndbd] INFO     -- Error handler shutdown completed - exiting
2013-04-01 12:53:27 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

(# sysbench --test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=1000000 --oltp-reconnect-mode=query --oltp-tables-count=10 --db-driver=mysql --mysql-user=user  --mysql-password=user --mysql-host= --mysql-port=3306 --mysql-db=test  --mysql-table-engine=ndbcluster --max-requests=300000 --num-threads=5 run
sysbench 0.5:  multi-threaded system evaluation benchmark

Running the test with following options:
Number of threads: 5
Random number generator seed is 0 and will be ignored

Threads started!

ALERT: failed to execute MySQL query: `SELECT DISTINCT c FROM sbtest10 WHERE id BETWEEN 502083 AND 502083+99 ORDER BY c`:
ALERT: Error 1297 Got temporary error 4028 'Node failure caused abort of transaction' from NDBCLUSTER
FATAL: failed to execute function `event': (null)
ALERT: failed to execute MySQL query: `SELECT c FROM sbtest9 WHERE id BETWEEN 498825 AND 498825+99 ORDER BY c`:
ALERT: Error 1297 Got temporary error 4028 'Node failure caused abort of transaction' from NDBCLUSTER
FATAL: failed to execute function `event': (null)
WARNING: mysql_store_result() failed with error: (1205) Lock wait timeout exceeded; try restarting transaction)

How to repeat:
Start with 4 node cluster.
Load sysbench data (10 tables, 50 million rows each, disk storage) into the cluster.
Add two additional data nodes for a total of 6.
Repartition data via `alter online table ... reorganize partition`
If system is IDLE - repartition will be successful.

To crash node, fire off a long running sysbench:
sysbench --test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=1000000 --oltp-reconnect-mode=query --oltp-tables-count=10 --db-driver=mysql --mysql-user=user  --mysql-password=user --mysql-host= --mysql-port=3306 --mysql-db=test  --mysql-table-engine=ndbcluster --max-requests=300000 --num-threads=5 run

And then perform an `alter online table ... reorganize partition`

Node shutdown will happen at some point - does not seem to be a repeatable, consistent time frame.
[1 Apr 2013 14:22] Patrick Zoblisein
ndb_error_report archive with trace files.

[18 Apr 2016 13:41] MySQL Verification Team

Pushed into mysql-5.1-telco-7.1 5.1.73-ndb-7.1.34 
Pushed into mysql-5.5-cluster-7.2 5.5.40-ndb-7.2.19 
Pushed into mysql-5.6-cluster-7.3 5.6.21-ndb-7.3.8 

Documented fix as follows in the NDB 7.1.34, 7.2.19, and 7.38 changelogs:
Online reorganization when using ndbmtd data nodes and with binary
logging by mysqld enabled could sometimes lead to failures in the TRIX
and DBLQH kernel blocks, or in silent data corruption.
See also BUG#19912988.