Description:
2 management nodes
2 sql nodes
6 ndbmtd data nodes
Cluster started out with 4 data nodes - have since added two more data nodes and now need to repartition the tables.
`alter online table ... reorganize partition` is causing random ndbmtd data node failures on a system with sysbench writing to it.
sysbench command below works fine with no `alter online table` in progress.
`alter online table` works fine with no sysbench in progress.
Failed node errorlog:
sendbufferpool waiting for lock, contentions: 9200 spins: 2060643
send lock node 19 waiting for lock, contentions: 3400 spins: 3137584
reorg, ignore ZNOT_FOUND
reorg, ignore ZNOT_FOUND
reorg, ignore ZNOT_FOUND
jbalock thr: 0 waiting for lock, contentions: 85000 spins: 12297789
reorg, ignore ZNOT_FOUND
reorg, ignore ZNOT_FOUND
reorg, ignore ZNOT_FOUND
2013-04-01 12:53:22 [ndbd] INFO -- /pb2/build/sb_0-7932439-1355951739.99/mysql-cluster-gpl-7.2.10/storage/ndb/src/kernel/blocks/trix/Trix.cpp
2013-04-01 12:53:22 [ndbd] INFO -- TRIX (Line: 766) 0x00000002
2013-04-01 12:53:22 [ndbd] INFO -- Error handler shutting down system
2013-04-01 12:53:22 [ndbd] INFO -- Error handler shutdown completed - exiting
2013-04-01 12:53:27 [ndbd] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
SYSBENCH error:
(# sysbench --test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=1000000 --oltp-reconnect-mode=query --oltp-tables-count=10 --db-driver=mysql --mysql-user=user --mysql-password=user --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=test --mysql-table-engine=ndbcluster --max-requests=300000 --num-threads=5 run
sysbench 0.5: multi-threaded system evaluation benchmark
Running the test with following options:
Number of threads: 5
Random number generator seed is 0 and will be ignored
Threads started!
ALERT: failed to execute MySQL query: `SELECT DISTINCT c FROM sbtest10 WHERE id BETWEEN 502083 AND 502083+99 ORDER BY c`:
ALERT: Error 1297 Got temporary error 4028 'Node failure caused abort of transaction' from NDBCLUSTER
FATAL: failed to execute function `event': (null)
ALERT: failed to execute MySQL query: `SELECT c FROM sbtest9 WHERE id BETWEEN 498825 AND 498825+99 ORDER BY c`:
ALERT: Error 1297 Got temporary error 4028 'Node failure caused abort of transaction' from NDBCLUSTER
FATAL: failed to execute function `event': (null)
WARNING: mysql_store_result() failed with error: (1205) Lock wait timeout exceeded; try restarting transaction)
How to repeat:
Start with 4 node cluster.
Load sysbench data (10 tables, 50 million rows each, disk storage) into the cluster.
Add two additional data nodes for a total of 6.
Repartition data via `alter online table ... reorganize partition`
If system is IDLE - repartition will be successful.
To crash node, fire off a long running sysbench:
sysbench --test=/usr/share/doc/sysbench/tests/db/oltp.lua --oltp-table-size=1000000 --oltp-reconnect-mode=query --oltp-tables-count=10 --db-driver=mysql --mysql-user=user --mysql-password=user --mysql-host=127.0.0.1 --mysql-port=3306 --mysql-db=test --mysql-table-engine=ndbcluster --max-requests=300000 --num-threads=5 run
And then perform an `alter online table ... reorganize partition`
Node shutdown will happen at some point - does not seem to be a repeatable, consistent time frame.