Bug #11789 cluster crashed in benchmark test
Submitted: 7 Jul 2005 1:12 Modified: 17 Sep 2005 8:28
Reporter: power wang Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.0.7 OS:Linux (Fedora Core 4)
Assigned to: Assigned Account CPU Architecture:Any

[7 Jul 2005 1:12] power wang
Description:
I set up a cluster environment composed of four machines. 
Each machine has individual role. 
One of them act as MGM node(192.168.0.29),and two storage node(192.168.0.27/28),then a mysql node(192.168.0.26). 
Each of them runs on Fedroa 4.And the cluster use Version 5.0.7 (beta). 

The config.ini config file is here: 
[NDB_MGMD DEFAULT] 
[MYSQLD DEFAULT] 
[TCP DEFAULT] 

[NDBD DEFAULT] 
NoOfReplicas=2 
DataDir=/var/lib/mysql-cluster 
FileSystemPath=/var/lib/mysql-cluster 
DataMemory=512M 
IndexMemory=128M 
NoOfFragmentLogFiles=300 
MaxNoOfAttributes=10000 
MaxNoOfTables=1024 
MaxNoOfOrderedIndexes=1024 
#MaxNoOfConcurrentOperations=250000 

[NDB_MGMD] 
hostname=192.168.0.29 
DataDir=/var/lib/mysql-cluster 
LogDestination=FILE:filename=cluster.log,maxsize=1000000,maxfiles=6 

[NDBD] 
hostname=192.168.0.27 

[NDBD] 
hostname=192.168.0.28 

[MYSQLD] 
[MYSQLD] 
[MYSQLD] 
[MYSQLD] 
[MYSQLD] 

The my.cnf file is: 
[MYSQLD] #Options for mysqld process: 
ndbcluster #run NDB engine 
ndb-connectstring=192.168.0.29 #location of MGM node 
default-storage-engine=ndbcluster
default-character-set=utf8
max_connections=1024

[MYSQL_CLUSTER] #Options for ndbd process: 
ndb-connectstring=192.168.0.29 #location of MGM node 

The cluster starts up sucessful and runs calmly. 
But it crashed when i have a test on it according the mysql test suite. 
Here is the command i run: 
./test-create --fast --verbose --host='192.168.0.26' --user='test' --password='pass' --database=sq_test --log --tcpip 
When cluster crashed,i get this info in the cluster.log: 
2005-07-06 10:38:47 [MgmSrvr] INFO -- Node 3: Started arbitrator node 1 [ticket=5dbd0001ea02764a] 
2005-07-06 10:47:44 [MgmSrvr] INFO -- Node 3: Data usage increased to 80%(13139 32K pages of total 16384) 
2005-07-06 10:48:32 [MgmSrvr] INFO -- Node 3: Data usage increased to 90%(14811 32K pages of total 16384) 
I cann't start the cluster then. 
I realized that the data memory is nearly full.Then i modify the config.ini: 
DataMemory=768M 
IndexMemory=200M 
(The total physisc is 1G) 
And then restart the cluster,but it does not work.I got this info: 
2005-07-06 12:46:59 [MgmSrvr] INFO -- NDB Cluster Management Server. Version 5.0.7 (beta) 
2005-07-06 12:46:59 [MgmSrvr] INFO -- Id: 1, Command port: 1186 
2005-07-06 12:49:09 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 reserved for ip 192.168.0.27, m_reserved_nodes 00000000 
00000006. 
2005-07-06 12:49:09 [MgmSrvr] INFO -- Node 1: Node 2 Connected 
2005-07-06 12:49:10 [MgmSrvr] INFO -- Mgmt server state: nodeid 2 freed, m_reserved_nodes 0000000000000002. 
2005-07-06 12:49:43 [MgmSrvr] INFO -- Node 2: Start phase 1 completed 
2005-07-06 12:50:43 [MgmSrvr] INFO -- Node 2: Start phase 2 completed (system restart) 
2005-07-06 12:50:43 [MgmSrvr] INFO -- Node 2: Start phase 3 completed (system restart) 
2005-07-06 12:50:43 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 
In the node2's log file ndb_2_error.log, I found this info: 
Date/Time: Wednesday 6 July 2005 - 12:50:43 
Type of error: error 
Message: Internal program error (failed ndbrequire) 
Fault ID: 2341 
Problem data: DbdihMain.cpp 
Object of reference: DBDIH (Line: 11757) 0x0000000a 
ProgramName: ndbd 
ProcessID: 3482 
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.13 
Version 5.0.7 (beta) 
***EOM*** 
and this info the other node: 
Date/Time: Wednesday 6 July 2005 - 12:27:26 
Type of error: error 
Message: Job buffer congestion 
Fault ID: 2334 
Problem data: Job Buffer Full 
Object of reference: APZJobBuffer.C 
ProgramName: ndbd 
ProcessID: 3193 
TraceFile: /var/lib/mysql-cluster/ndb_3_trace.log.10 
Version 5.0.7 (beta) 
***EOM*** 

I simaple to know what shall i do now. 
If the data can retrieval? And how can i get useful info from trace.log? 
Any ideas? 
Thank 
Best regards

How to repeat:
In sq_bench directory,runs this command:./test-create --fast --verbose --host='192.168.0.26' --user='test' --password='pass' --database=sq_test --log --tcpip. And first you config default-storage-engine=ndbcluster.I increate the memory,and try it again,get the same error,and the cluster cannot start.The error  hits:
In mgm node:
2005-07-06 17:39:16 [MgmSrvr] INFO     -- Node 2: Node 3: API version 5.0.7
2005-07-06 17:39:16 [MgmSrvr] INFO     -- Node 3: Node 2: API version 5.0.7
2005-07-06 17:39:16 [MgmSrvr] INFO     -- Node 3: Start phase 1 completed
2005-07-06 17:39:16 [MgmSrvr] INFO     -- Node 3: Start phase 2 completed (system restart)
2005-07-06 17:39:16 [MgmSrvr] INFO     -- Node 2: Start phase 2 completed (system restart)
2005-07-06 17:39:16 [MgmSrvr] INFO     -- Node 3: Start phase 3 completed (system restart)
2005-07-06 17:39:16 [MgmSrvr] INFO     -- Node 2: Start phase 3 completed (system restart)
2005-07-06 17:44:38 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2005-07-06 17:44:39 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected

In data node:
Date/Time: Wednesday 6 July 2005 - 17:31:16
Type of error: error
Message: Job buffer congestion
Fault ID: 2334
Problem data: Job Buffer Full
Object of reference: APZJobBuffer.C
ProgramName: ndbd
ProcessID: 5810
TraceFile: /var/lib/mysql-cluster/ndb_2_trace.log.14
Version 5.0.7 (beta)
***EOM***

In sql node:
[root@mysql_sqld sql-bench]# ./test-create --verbose --host='192.168.0.26' --user='test' --password='pass' --database=sq_test --log --tcpip
Testing server 'MySQL 5.0.7 beta max log' at 2005-07-06 17:09:17

Testing the speed of creating and dropping tables
Testing with 10000 tables and 10000 loop count

Testing create of tables
Can't execute command 'create table bench_1024 (i int NOT NULL,d double,f float,s char(10),v varchar(100),primary key (i))'
Error: Can't create table './sq_test/bench_1024.frm' (errno: 904)
[root@mysql_sqld sql-bench]# perror --ndb 904;
OS error code 904:  Out of fragment records (increase MaxNoOfOrderedIndexes): Permanent error: Insufficient space

When i increate the MaxNoOfOrderedIndexes,the cluster can not start any more.
[7 Jul 2005 1:16] power wang
cluster.log

Attachment: cluster.log (text/plain), 106.62 KiB.

[7 Jul 2005 1:16] power wang
ndb_2_error.log

Attachment: ndb_2_error.log (text/plain), 6.30 KiB.

[7 Jul 2005 1:21] Jonathan Miller
I would set this to a show stopper, but it does not give me the option.
[7 Jul 2005 1:25] power wang
ndb_2_trace.log.15

Attachment: ndb_2_trace.log.rar (application/octet-stream, text), 36.54 KiB.

[8 Jul 2005 5:28] power wang
First guarantee this test is work on cluster engine by set default-storage-engine=ndbcluster.
Then run the benchmark test in the sq_bench directory,as this script: 

./run-all-tests --fast --verbose --host='your host' --user='test'
--password='pass' --database=sq_test --log --tcpip 
or 
./test-create --fast --verbose --host='your host' --user='test'
--password='pass' --database=sq_test --log --tcpip
or 
./test-insert --fast --verbose --host='your host' --user='test'
--password='pass' --database=sq_test --log --tcpip

the cluster hit the data memory is increased to 90%,then the test terminate.
Then i can not  start the cluster.
[12 Jul 2005 3:52] Jorge del Conde
I was able to reproduce this bug using 5.0 from bk and FC4 w/all updates & patches installed.
[11 Aug 2005 15:14] Martin Skold
This bug can not be verified by cluster team, please add verification
procedure.
[17 Aug 2005 8:28] Jonas Oreland
Hi,

I tried to repeat this...i'm not really sure what "this" is...but this is what I
tried
1) started initial cluster (2 nodes)
2) run test-create --fast
after a while it gave me out of datapages (881)
3) then I stopped cluster
4) added more memory and restarted cluster

Everything went fine.
What am I missing?

/Jonas
[17 Sep 2005 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".