Description:
ndbd crashes and failes to start after adding disk data over 4.5GB.
I wanted to extend that tablespace to accomodate a load of data that will be migrated to the cluster from a mysql running on windows. Database size that will be migrated is approx 50GB. Now instead the ndbd wont start :(
Setup is following:
* Two cluster nodes, a few tables in RAM, a tablespace with 4.5GB (not containing data atm).
* The tablespace needs to be extended to start filling it up with data.
* The tablespace is extended with one file, (this seem to work fine?)
mysql> ALTER TABLESPACE ts_1 ADD DATAFILE 'data_4.dat' INITIAL_SIZE 2047M ENGINE NDBCLUSTER;
Query OK, 0 rows affected (51,95 sec)
* Waiting 20 min for the cluster to settle.
* Another extension is applied;
mysql> ALTER TABLESPACE ts_1 ADD DATAFILE 'data_5.dat' INITIAL_SIZE 2047M ENGINE NDBCLUSTER;
ERROR 1533 (HY000): Failed to alter: CREATE DATAFILE
* Here i should probably have done a show warnings, but instead it went like this;
mysql> SELECT LOGFILE_GROUP_NAME, FILE_NAME, EXTRA, TABLESPACE_NAME, DATA_FREE, DATA_LENGTH FROM INFORMATION_SCHEMA.FILES;
Empty set, 1 warning (0,00 sec)
mysql> SELECT LOGFILE_GROUP_NAME, FILE_NAME, EXTRA, TABLESPACE_NAME, DATA_FREE, DATA_LENGTH FROM INFORMATION_SCHEMA.FILES;
Empty set, 1 warning (0,01 sec)
mysql> show warnings;
+-------+------+-------------------------------------------+
| Level | Code | Message |
+-------+------+-------------------------------------------+
| Error | 1296 | Got error 4009 'Cluster Failure' from NDB |
+-------+------+-------------------------------------------+
1 row in set (0,00 sec)
* I log out of the server and goes into management console, show produces the following (ips changed);
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 2 node(s)
id=2 (not connected, accepting connect from 1.2.3.4)
id=3 (not connected, accepting connect from 1.2.3.5)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @1.2.3.9 (Version: 5.1.40)
*I discover that the ndb processes have died on both cluster servers.
*The new data files are however there;
-rw-r--r-- 1 root mysql 268500992 30 nov 01.54 data_1.dat
-rw-r--r-- 1 root mysql 268500992 30 nov 01.54 data_2.dat
-rw-r--r-- 1 root mysql 2146533376 30 nov 02.30 data_3.dat
-rw-r--r-- 1 root mysql 2146533376 15 dec 12.16 data_4.dat
-rw-r--r-- 1 root mysql 2146533376 15 dec 12.48 data_5.dat
* I try to restart ndbd, ndbd goes into background and the mgm reports the following:
Node 3: Forced node shutdown completed. Occured during startphase 4. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
* This is from ndb2_error.log:
Time: Tuesday 15 December 2009 - 13:21:21
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 3567) 0x0000000a
Program: ndbd
Pid: 30532
Trace: /usr/local/mysql/var/mysql-cluster/ndb_2_trace.log.4
Version: Version 5.1.40
***EOM***
*The ndb_2_trace.log.4 contains the following loop:
--------------- Signal ----------------
r.bn: 246 "DBDIH", r.proc: 2, r.sigId: 242674 gsn: 238 "DISEIZEREQ" prio: 1
s.bn: 245 "DBTC", s.proc: 2, s.sigId: 242673 length: 2 trace: 0 #sec: 0 fragInf: 0
H'0001098d H'00f50002
--------------- Signal ----------------
r.bn: 245 "DBTC", r.proc: 2, r.sigId: 242673 gsn: 236 "DISEIZECONF" prio: 1
s.bn: 246 "DBDIH", s.proc: 2, s.sigId: 242672 length: 2 trace: 0 #sec: 0 fragInf: 0
H'0001098c H'00008184
*The mysql1.err shows the following:
091215 13:06:11 [ERROR] /usr/local/mysql/libexec/mysqld: Incorrect information in file: './database1/table1.frm'
091215 13:06:11 [ERROR] /usr/local/mysql/libexec/mysqld: Incorrect information in file: './database1/table2.frm'
And so on, for all tables and databases, even the ones that arent supposed to use disk storage.
===============================
Show ENGINE NDB STATUS before applying tablespace; (stipped ip and placeholders)
mysql> SHOW ENGINE NDB STATUS;
| Type | Name | Status |
| ndbcluster | connection | cluster_node_id=4, connected_host=1.2.3.9, connected_port=1186, number_of_data_nodes=2, number_of_ready_data_nodes=2, connect_count=1 |
| ndbcluster | NdbTransaction | created=3, free=0, sizeof=212 |
| ndbcluster | NdbOperation | created=4, free=4, sizeof=660 |
| ndbcluster | NdbIndexScanOperation | created=1, free=1, sizeof=744 |
| ndbcluster | NdbIndexOperation | created=0, free=0, sizeof=664 |
| ndbcluster | NdbRecAttr | created=829, free=829, sizeof=60 |
| ndbcluster | NdbApiSignal | created=16, free=16, sizeof=136 |
| ndbcluster | NdbLabel | created=0, free=0, sizeof=196 |
| ndbcluster | NdbBranch | created=0, free=0, sizeof=24 |
| ndbcluster | NdbSubroutine | created=0, free=0, sizeof=68 |
| ndbcluster | NdbCall | created=0, free=0, sizeof=16 |
| ndbcluster | NdbBlob | created=1, free=1, sizeof=264 |
| ndbcluster | NdbReceiver | created=2, free=0, sizeof=68 |
| ndbcluster | binlog | latest_epoch=1111500, latest_trans_epoch=1111392, latest_received_binlog_epoch=0, latest_handled_binlog_epoch=0, latest_applied_binlog_epoch=0 |
Tablespace layout before extending:
mysql> SELECT LOGFILE_GROUP_NAME, FILE_NAME, EXTRA, TABLESPACE_NAME FROM INFORMATION_SCHEMA.FILES WHERE FILE_TYPE = 'DATAFILE';
+--------------------+------------+----------------+-----------------+
| LOGFILE_GROUP_NAME | FILE_NAME | EXTRA | TABLESPACE_NAME |
+--------------------+------------+----------------+-----------------+
| lg_1 | data_2.dat | CLUSTER_NODE=2 | ts_1 |
| lg_1 | data_2.dat | CLUSTER_NODE=3 | ts_1 |
| lg_1 | data_3.dat | CLUSTER_NODE=2 | ts_1 |
| lg_1 | data_3.dat | CLUSTER_NODE=3 | ts_1 |
| lg_1 | data_1.dat | CLUSTER_NODE=2 | ts_1 |
| lg_1 | data_1.dat | CLUSTER_NODE=3 | ts_1 |
+--------------------+------------+----------------+-----------------+
How to repeat:
Try to create two NDB tablespaces of size 2G in close proximity.
Suggested fix:
If the first data have not propagated properly, maybe give an error message and suggest to wait.
Did not expect the whole cluster to break.