MySQL Bugs: #40070: repeating insert/delete rows NDB disk data tables cause ndbd crash

Bug #40070	repeating insert/delete rows NDB disk data tables cause ndbd crash
Submitted:	16 Oct 2008 8:06	Modified:	29 Jan 2009 14:11
Reporter:	Kenji Hirohama	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Disk Data	Severity:	S2 (Serious)
Version:	6.2.15	OS:	Solaris (10)
Assigned to:		CPU Architecture:	Any

Description:
In our testing environment, ndbd goes down after our appication keep updates for several hours.
The error message of ndb_2_error.log is here;
=====
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message, please report a bug)
Error: 2303
Error data: Node 2 killed this node because GCP stop was detected
Error object: NDBCNTR (Line: 235) 0x0000000e
Program: ndbd
Pid: 16496
Trace: /export/home/mysql-cluster//ndb_2_trace.log.1
Version: mysql-5.1.23 ndb-6.2.15
***EOM***
=====

How to repeat:
We can't repeat the problem with our application, so I tried to repeat the problem with the following procedure.

1. creating the table.

create database error_test;

use error_test;

create logfile group lg_hiro add undofile 'undo_hiro.dat' engine ndb;

create tablespace ts_hiro add datafile 'data_hiro.dat' use logfile group lg_hiro engine ndb;

DROP TABLE IF EXISTS `error_test`.`t1`;
CREATE TABLE  `error_test`.`t1` (
  `c1` bigint NOT NULL auto_increment,
  `c2` tinyint(4) NOT NULL default 99,
  `c3` char(32) NOT NULL default 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCC',
  `c4` char(32) NOT NULL default 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCC',
  `c5` char(39) NOT NULL default 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCC',
  PRIMARY KEY (`c1`),
  KEY `Index_2` (`c2`)
) /*!50100 TABLESPACE ts_hiro STORAGE DISK */ ENGINE=ndbcluster DEFAULT CHARSET=utf8;

2. repeat INSERT/DELETE

insert into t1(c2,c3,c4) select c2, c3, c4 from t1;
delete from t1 limit 10000;

3. during INSERT/DELETE, checking the available table file

select * from Table_Space;
(This view schema is attached.)

4. after some time, ndbd goes crash

mysql> delete from t1 limit 100000;
ERROR 1296 (HY000): Got error 157 'Unknown error code' from NDBCLUSTER

5. during INSERT/DELETE, I see the following error message, but ignore.

mysql> insert into t1(c2,c3,c4) select c2, c3, c4 from t1;
Query OK, 65536 rows affected (25.55 sec)
Records: 65536  Duplicates: 0  Warnings: 0

mysql> insert into t1(c2,c3,c4) select c2, c3, c4 from t1;
ERROR 1297 (HY000): Got temporary error 233 'Out of operation records in transaction coordinator (increase MaxNoOfConcurrentOperations)' from NDBCLUSTER

The setting file is the following;
=====
[root@Solaris1 mysql-cluster]# cat config.ini
[TCP DEFAULT]
SendBufferMemory=2M
ReceiveBufferMemory=2M

[NDB_MGMD DEFAULT]
PortNumber=1186
Datadir=/export/home/mysql-cluster/

[NDB_MGMD]
Hostname=172.20.100.28

[NDBD DEFAULT]

MemReportFrequency=5

NoOfReplicas=2
Datadir=/export/home/mysql-cluster/
#DataMemory=10240M
#IndexMemory=1280M
LockPagesInMainMemory=1

MaxNoOfConcurrentOperations=100000

StringMemory=25
MaxNoOfTables=4096
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
MaxNoOfAttributes=24576
DiskCheckpointSpeedInRestart=100M
#FragmentLogFileSize=256M
#NoOfFragmentLogFiles=120
RedoBuffer=32M

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

### Disk data related
DiskPageBufferMemory=64M
SharedGlobalMemory=256M
[NDBD]
Hostname=172.20.100.28

[NDBD]
Hostname=172.20.100.28

[MYSQLD]
Hostname=172.20.100.28
[MYSQLD]
Hostname=172.20.100.28
[MYSQLD]
Hostname=172.20.100.28
[MYSQLD]
Hostname=172.20.100.28
[MYSQLD]
Hostname=172.20.100.28

[MYSQLD]
Hostname=172.20.100.28
[MYSQLD]
Hostname=172.20.100.28
[MYSQLD]
Hostname=172.20.100.28
[MYSQLD]
Hostname=172.20.100.28
[MYSQLD]
Hostname=172.20.100.28

=====
[root@Solaris1 mysql-cluster]# cat /etc/my.cnf
[MYSQLD]
innodb_buffer_pool_size = 2G
socket=/tmp/mysql.sock
user=mysql
basedir=/usr/local/mysql
datadir=/export/home/mysql/data
#ndb-cluster-connection-pool=10
ndbcluster
ndb-connectstring="172.20.100.28"
ndb-force-send=1
ndb-use-exact-count=0
ndb-extra-logging=1
ndb-autoincrement-prefetch-sz=256
engine-condition-pushdown=1

#REPLICATION SPECIFIC - GENERAL
#server-id must be unique across all mysql servers participating in replication.
#server-id=X
#REPLICATION SPECIFIC - MASTER
#log-bin
#REPLICATION SPECIFIC - SLAVE
#slave-allow-batching
#skip-slave-start

#OTHER THINGS, BUFFERS ETC
key_buffer = 256M
max_allowed_packet = 16M
sort_buffer_size = 512K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
thread_cache_size=1024
myisam_sort_buffer_size = 8M
init_connect='set autocommit=1'
#memlock
sysdate_is_now
max-connections=1050
thread-cache-size=64
skip_thread_priority
query-cache-type = 0
query-cache-size = 0
table-open_cache=1024
table-cache=512
lower-case-table-names=0
[root@Solaris1 mysql-cluster]#
=====

CREATE VIEW Table_Space AS
SELECT FILE_NAME,
(TOTAL_EXTENTS * EXTENT_SIZE)/
(1024*1024) AS 'Total MB',
(FREE_EXTENTS * EXTENT_SIZE)/
(1024*1024) AS 'Free MB',
EXTRA FROM INFORMATION_SCHEMA.FILES
WHERE ENGINE = 'NDBCLUSTER' AND
FILE_TYPE = 'DATAFILE';

can you try increasing size of DiskPageBufferMemory

Do you have any insights from the attached stack traces?

I tried DiskPageBufferMemory=2G.
Still got the same error at the same point.

Then, 8G.
The situation is the same.

I tried to set 10G, but ndbd can't start.
(My box has 32GB physical memory.)

Perhaps, I need more memory, but buy more memory is not realistic,
so do you have any other ideas?

Thanks,

I confirmed that the problem happens on 6.3.18 as well.

i repeated the problem using the following config.ini

------------------------------------------------------
[MGM]
Id=1
DataDir=/var/lib/telco-6.3/mgm
Hostname=127.0.0.1

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=128M
IndexMemory=16M
LockPagesInMainMemory=1

MaxNoOfTables=256
MaxNoOfOrderedIndexes=512
MaxNoOfUniqueHashIndexes=256
MaxNoOfAttributes=8192
MaxNoOfConcurrentOperations=50000
FragmentLogFileSize=128M
NoOfFragmentLogFiles=4
RedoBuffer=64M
ODirect=1
#TimeBetweenGlobalCheckpoints=10000

### Disk data related 
DiskPageBufferMemory=64M
SharedGlobalMemory=128M

[NDBD]
Id=41
Hostname=127.0.0.1
DataDir=/var/lib/telco-6.3/ndbd1

[NDBD]
Id=42
Hostname=127.0.0.1
DataDir=/var/lib/telco-6.3/ndbd2

[NDBD]
Id=43
Hostname=127.0.0.1
DataDir=/var/lib/telco-6.3/ndbd3

[NDBD]
Id=44
Hostname=127.0.0.1
DataDir=/var/lib/telco-6.3/ndbd4

[MYSQLD]
Hostname=127.0.0.1

[MYSQLD]
Hostname=127.0.0.1

I repeated the problem using the following script:

-------------------------------------------
#!/bin/bash

MYSQL="mysql -h 127.0.0.1 -P 20001 "
n=0

$MYSQL -e "DROP DATABASE disk_test"
$MYSQL -e "CREATE DATABASE disk_test"
$MYSQL disk_test -e "CREATE LOGFILE GROUP lg_1 ADD UNDOFILE 'undo_1.dat' INITIAL_SIZE 128M UNDO_BUFFER_SIZE 16M ENGINE NDB"
$MYSQL disk_test -e "CREATE TABLESPACE ts_1 ADD DATAFILE 'data_1.dat' USE LOGFILE GROUP lg_1 INITIAL_SIZE 256M ENGINE NDB"
$MYSQL disk_test -e "CREATE TABLE t1 (c1 bigint NOT NULL auto_increment, c2 tinyint(4) NOT NULL default 99, c3 char(32) NOT NULL default 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCC', c4 char(32) NOT NULL default 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCC', c5 char(39) NOT NULL default 'AAAAAAAAAABBBBBBBBBBCCCCCCCCCC', PRIMARY KEY (c1), KEY Index_2 (c2)) /*!50100 TABLESPACE ts_1 STORAGE DISK */ ENGINE=ndbcluster DEFAULT CHARSET=utf8"

mysql_sampledata_gen 127.0.0.1 20001 mikiya okuno disk_test t1 10000

while [ 1 ]
do
	loop=0
	while [ $loop -lt 20 ]
	do
		$MYSQL disk_test -e "insert into t1(c2,c3,c4) select c2, c3, c4 from t1 limit 10000"
		loop=`expr $loop + 1`
	done
	loop=0
	while [ $loop -lt 20 ]
	do
		$MYSQL disk_test -e "delete from t1 limit 10000"
		loop=`expr $loop + 1`
	done
	echo $n
	n=`expr $n + 1`
	if [ -f /tmp/stopit ]
	then
		exit 1
	fi
done

The problem happens randomly, and always the master node crashes. You'll see a crash within 100 loops.

I did additional tests:

with smaller transaction (LIMIT 1000) the error didn't happen.
with smaller transaction (LIMIT 1000) and the larger tables (varchar(512) instead of char(32)), the error happened.
with larger DiskPageBufferMemory (640M), the error happened.

So, the possible workaround could be use smaller transactions.

When I did tests using a 64MB undo buffer, and the problem didn't happen.

Possible workarounds are:

* increase TimeBetweenEpochsTimeout
* ensure  DiskPageBufferMemory and Undo buffer is enough
* commit more often

If a large transaction is made and disk page buffer and undo buffer are exhausted, then it will take very long time to commit. Then, TimeBetweenEpochsTimeout is hit.

http://bugs.mysql.com/bug.php?id=37227