Description:
I set up a cluster environment composed of four machines.
Each machine has individual role.
One of them act as MGM node(192.168.0.29),and two storage node(192.168.0.27/28),then a mysql node(192.168.0.26).
Each of them runs on Fedroa 4.And the cluster use Version 5.0.7 (beta).
The config.ini config file is here:
[NDB_MGMD DEFAULT]
[MYSQLD DEFAULT]
[TCP DEFAULT]
[NDBD DEFAULT]
NoOfReplicas=2
DataDir=/var/lib/mysql-cluster
FileSystemPath=/var/lib/mysql-cluster
DataMemory=512M
IndexMemory=128M
NoOfFragmentLogFiles=300
MaxNoOfAttributes=10000
MaxNoOfTables=1024
MaxNoOfOrderedIndexes=1024
#MaxNoOfConcurrentOperations=250000
[NDB_MGMD]
hostname=192.168.0.29
DataDir=/var/lib/mysql-cluster
LogDestination=FILE:filename=cluster.log,maxsize=1000000,maxfiles=6
[NDBD]
hostname=192.168.0.27
[NDBD]
hostname=192.168.0.28
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
The my.cnf file is:
[MYSQLD] #Options for mysqld process:
ndbcluster #run NDB engine
ndb-connectstring=192.168.0.29 #location of MGM node
default-storage-engine=ndbcluster
default-character-set=utf8
max_connections=1024
[MYSQL_CLUSTER] #Options for ndbd process:
ndb-connectstring=192.168.0.29 #location of MGM node
The cluster starts up sucessful and runs calmly.
But it crashed when i have a test on it according the mysql test suite.
Here is the command i run:
./test-create --fast --verbose --host='192.168.0.26' --user='test' --password='pass' --database=sq_test --log --tcpip
When cluster crashed,i get this info in the cluster.log:
2005-07-06 10:38:47 [MgmSrvr] INFO -- Node 3: Started arbitrator node 1 [ticket=5dbd0001ea02764a]
2005-07-06 10:47:44 [MgmSrvr] INFO -- Node 3: Data usage increased to 80%(13139 32K pages of total 16384)
2005-07-06 10:48:32 [MgmSrvr] INFO -- Node 3: Data usage increased to 90%(14811 32K pages of total 16384)
I cann't start the cluster then.
I realized that the data memory is nearly full.Then i modify the config.ini:
DataMemory=768M
IndexMemory=200M
(The total physisc is 1G)
And then restart the cluster,but it does not work.I got this info:
2005-07-06 12:46:59 [MgmSrvr] INFO -- NDB Cluster Management Server. Version 5.0.7 (beta)
2005-07-06 12:46:59 [MgmSrvr] INFO -- Id: 1, Command port: 1186
2005-07-06 12:49:09 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 reserved for ip 192.168.0.27, m_reserved_nodes 00000000
00000006.
2005-07-06 12:49:09 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2005-07-06 12:49:10 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000002.
2005-07-06 12:49:43 [MgmSrvr] INFO -- Node 2: Start phase 1 completed
2005-07-06 12:50:43 [MgmSrvr] INFO -- Node 2: Start phase 2 completed (system restart)
2005-07-06 12:50:43 [MgmSrvr] INFO -- Node 2: Start phase 3 completed (system restart)
2005-07-06 12:50:43 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
In the node2's log file ndb_2_error.log, I found this info:
Date/Time: Wednesday 6 July 2005 - 12:50:43
Type of error: error
Message: Internal program error (failed ndbrequire)
Fault ID: 2341
Problem data: DbdihMain.cpp
Object of reference: DBDIH (Line: 11757) 0x0000000a
ProgramName: ndbd
ProcessID: 3482
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.13
Version 5.0.7 (beta)
***EOM***
and this info the other node:
Date/Time: Wednesday 6 July 2005 - 12:27:26
Type of error: error
Message: Job buffer congestion
Fault ID: 2334
Problem data: Job Buffer Full
Object of reference: APZJobBuffer.C
ProgramName: ndbd
ProcessID: 3193
TraceFile: /var/lib/mysql-cluster/ndb_3_trace.log.10
Version 5.0.7 (beta)
***EOM***
I simaple to know what shall i do now.
If the data can retrieval? And how can i get useful info from trace.log?
Any ideas?
Thank
Best regards
How to repeat:
In sq_bench directory,runs this command:./test-create --fast --verbose --host='192.168.0.26' --user='test' --password='pass' --database=sq_test --log --tcpip. And first you config default-storage-engine=ndbcluster.I increate the memory,and try it again,get the same error,and the cluster cannot start.The error hits:
In mgm node:
2005-07-06 17:39:16 [MgmSrvr] INFO -- Node 2: Node 3: API version 5.0.7
2005-07-06 17:39:16 [MgmSrvr] INFO -- Node 3: Node 2: API version 5.0.7
2005-07-06 17:39:16 [MgmSrvr] INFO -- Node 3: Start phase 1 completed
2005-07-06 17:39:16 [MgmSrvr] INFO -- Node 3: Start phase 2 completed (system restart)
2005-07-06 17:39:16 [MgmSrvr] INFO -- Node 2: Start phase 2 completed (system restart)
2005-07-06 17:39:16 [MgmSrvr] INFO -- Node 3: Start phase 3 completed (system restart)
2005-07-06 17:39:16 [MgmSrvr] INFO -- Node 2: Start phase 3 completed (system restart)
2005-07-06 17:44:38 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2005-07-06 17:44:39 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
In data node:
Date/Time: Wednesday 6 July 2005 - 17:31:16
Type of error: error
Message: Job buffer congestion
Fault ID: 2334
Problem data: Job Buffer Full
Object of reference: APZJobBuffer.C
ProgramName: ndbd
ProcessID: 5810
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.14
Version 5.0.7 (beta)
***EOM***
In sql node:
[root@mysql_sqld sql-bench]# ./test-create --verbose --host='192.168.0.26' --user='test' --password='pass' --database=sq_test --log --tcpip
Testing server 'MySQL 5.0.7 beta max log' at 2005-07-06 17:09:17
Testing the speed of creating and dropping tables
Testing with 10000 tables and 10000 loop count
Testing create of tables
Can't execute command 'create table bench_1024 (i int NOT NULL,d double,f float,s char(10),v varchar(100),primary key (i))'
Error: Can't create table './sq_test/bench_1024.frm' (errno: 904)
[root@mysql_sqld sql-bench]# perror --ndb 904;
OS error code 904: Out of fragment records (increase MaxNoOfOrderedIndexes): Permanent error: Insufficient space
When i increate the MaxNoOfOrderedIndexes,the cluster can not start any more.