MySQL Bugs: #66337: Cluster crashes Error 2341 (still exists in 7.2.7)

Bug #66337	Cluster crashes Error 2341 (still exists in 7.2.7)
Submitted:	12 Aug 2012 16:31	Modified:	18 Apr 2014 16:25
Reporter:	Stefan Auweiler	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	gpl-7.2.7	OS:	Solaris (5.10 Generic_147441-15 i86pc i386 i86pc)
Assigned to:		CPU Architecture:	Any
Tags:	cluster crash, error 2341

Description:
mysql-cluster-gpl-7.2.7-solaris10-x86_64

Hello, 
the Error 2341 is reported to be fixed in 7.2.7, but it has not... at least for me.

Had have it several time this week in 7.2.6.
After reading all information, decided to upgrade to 7.2.7 (because it was reported to be fixed there)

On Thursday, I even did a verry verry limited setup:

1 NodeGroup with one Datanode (NoOfReplicas = 1)
1 MGM Node
1 MySQL Nodes (...connected, configured about 200)

Verry plain schema (a RADIUS Server):

CREATE TABLE `sessions` (
`RadAcctId` bigint(21) unsigned NOT NULL AUTO_INCREMENT,
`AcctUniqueId` char(32) NOT NULL,
`AcctStatusType` tinyint(3) unsigned NOT NULL DEFAULT '1',
`AcctSessionId` char(32) NOT NULL,
`VolumeBucket` char(15) NOT NULL DEFAULT 'common',
`UserName` char(60) NOT NULL DEFAULT '',
`ContractInfo` char(250) NOT NULL,
`CallingStationId` bigint(20) unsigned NOT NULL DEFAULT '0',
`CalledStationId` char(30) NOT NULL DEFAULT '',
`DeviceTAC` char(8) NOT NULL DEFAULT '',
`mccmnc` char(8) NOT NULL,
`PacketSrcIpAddress` char(15) NOT NULL DEFAULT '',
`NASIPAddress` char(15) NOT NULL DEFAULT '',
`FramedIPAddress` char(15) NOT NULL DEFAULT '',
`AcctInputOctets` bigint(20) unsigned NOT NULL DEFAULT '0',
`AcctOutputOctets` bigint(20) unsigned NOT NULL DEFAULT '0',
`n_id` bigint(21) unsigned DEFAULT NULL,
`n_Volume` bigint(20) unsigned NOT NULL DEFAULT '0',
`n_Time` mediumint(8) unsigned NOT NULL,
`voldbchecked` smallint(6) unsigned NOT NULL DEFAULT '1',
`claimed_by` varchar(15) NOT NULL,
`claimed_at` int(11) unsigned NOT NULL DEFAULT '0',
`AcctStartTime` int(11) unsigned NOT NULL DEFAULT '0',
`AcctUpdateTime` int(11) unsigned NOT NULL DEFAULT '0',
`AcctStopTime` int(11) unsigned NOT NULL DEFAULT '0',
`AcctTerminateCause` char(32) NOT NULL DEFAULT '',
PRIMARY KEY (`CallingStationId`,`AcctUniqueId`),
UNIQUE KEY `RadAcctId` (`RadAcctId`),
KEY `CallingStationId` (`CallingStationId`),
KEY `voldbchecked` (`voldbchecked`),
KEY `VolumeBucket` (`VolumeBucket`),
KEY `AcctStatusType` (`AcctStatusType`),
KEY `AcctStartTime` (`AcctStartTime`),
KEY `claimed_by` (`claimed_by`)
) ENGINE=ndbcluster AUTO_INCREMENT=0 DEFAULT CHARSET=latin1;

The Load is one (1!) Insert/s plus 2-3 Updates/s, but I share Arlo Gilbert's observation for the high load situation: we loaded the fully configured cluster (10 Datanodes, 10 API Nodes with 1200 Inserts/s, about 9000 Updates/s, having 239 Million Sets after a few days, where we used up the complete configured Memory... without problems.

After a few hours with 1/s:
ndb_mgm> Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

If there is no load at all, the cluster has no problem.

As this will be the final configuration, we tried with and without logging (SET @@ndb_table_no_logging = 1;)

We are running mysql-cluster-gpl-7.1.14-solaris10-x86_64.tar.gz on our current producktion system, and wanted to upgrade to 7.2 for the comfort of having loadbalanced tracsaction coordination (seems to be single thraeded in 7.1)
7.1.14 is stable (from this point of view) even if the Cluster Team stated, that the 2341 Buck is there since 7.0 ...

Will there be a solution?
Any ideas?

To be mentioned:
- We are connecting mysqld local explicitely using the sock file
- we have no replication settings configured in mysqld ... even no replication id

Our Files:

###############################
My config.ini

[TCP DEFAULT]
SendBufferMemory=8M
ReceiveBufferMemory=8M

[NDB_MGMD DEFAULT]
PortNumber=11861
Datadir=/DB/mysql/sessions

[NDB_MGMD]
NodeId=1
Hostname=dbmgr1
LogDestination=FILE:filename=ndb_1_cluster.log,maxsize=10000000,maxfiles=6
ArbitrationRank=1

#[NDB_MGMD]
#NodeId=2
#Hostname=dbmgr2
#LogDestination=FILE:filename=ndb_2_cluster.log,maxsize=10000000,maxfiles=6
#ArbitrationRank=1

[NDBD DEFAULT]
NoOfReplicas=1
Datadir=/DB/mysql/sessions
FileSystemPathDD=/DB/mysql/sessions
BackupDataDir=/DB/mysql/backup/sessions
#FileSystemPathUndoFiles=/DB/mysql/sessions
#FileSystemPathDataFiles=/DB/mysql/sessions
DataMemory=40000M
IndexMemory=6000M
LockPagesInMainMemory=1

MaxNoOfConcurrentOperations=100000
MaxNoOfConcurrentTransactions=16384

StringMemory=25
MaxNoOfTables=4096
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
MaxNoOfAttributes=24576
MaxNoOfTriggers=14336

### Params for REDO LOG
FragmentLogFileSize=256M
InitFragmentLogFiles=SPARSE
NoOfFragmentLogFiles=170
RedoBuffer=64M

TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

TimeBetweenEpochsTimeout=32000

### Params for LCP
DiskCheckpointSpeedInRestart=100M
DiskCheckpointSpeed=10M
TimeBetweenLocalCheckpoints=20

### Heartbeating
HeartbeatIntervalDbDb=1500
HeartbeatIntervalDbApi=1500

### Params for setting logging
MemReportFrequency=30
BackupReportFrequency=10
LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

### Params for BACKUP
BackupMaxWriteSize=1M
BackupDataBufferSize=16M
BackupLogBufferSize=4M
BackupMemory=20M

### Params for ODIRECT
#Reports indicates that odirect=1 can cause io errors (os err code 5) on some systems. You must test.
ODirect=1

### Watchdog
TimeBetweenWatchdogCheckInitial=60000

### TransactionInactiveTimeout - should be enabled in Production
TransactionInactiveTimeout=60000
### REALTIME EXTENSIONS
#RealTimeScheduler=1
### REALTIME EXTENSIONS FOR 6.3 ONLY
#SchedulerExecutionTimer=80
#SchedulerSpinTimer=40

### DISK DATA
SharedGlobalMemory=20M
DiskPageBufferMemory=64M

### Multithreading
MaxNoOfExecutionThreads=8

### Increasing the LongMessageBuffer b/c of a bug (20090903)
LongMessageBuffer=32M

BatchSizePerLocalScan=512

[NDBD]
NodeId=3
Hostname=dbsession1

#[NDBD]
#NodeId=4
#Hostname=dbsession2

#[NDBD]
#NodeId=5
#Hostname=dbsession3

#[NDBD]
#NodeId=6
#Hostname=dbsession4

#[NDBD]
#NodeId=7
#Hostname=dbsession5

#[NDBD]
#NodeId=8
#Hostname=dbsession6

#[NDBD]
#NodeId=9
#Hostname=dbsession7

#[NDBD]
#NodeId=10
#Hostname=dbsession8

#[NDBD]
#NodeId=11
#Hostname=dbsession9

#[NDBD]
#NodeId=12
#Hostname=dbsession10

[MYSQLD DEFAULT]
BatchSize=512
#BatchByteSize=2048K
#MaxScanBatchSize=2048K

[MYSQLD]
NodeId=21
Hostname=dbapife1
[MYSQLD]
NodeId=22
Hostname=dbapife1

.... until

NodeId=220
Hostname=dbmgr2

###############################

My My.cnf Files

[MYSQLD]
user=mysql
basedir=/usr/local/mysqlbin
datadir=/DB/mysql/sessions
pid-file=mysqld.pid
socket=/var/lib/mysql/mysqlsessions.sock
port=33061
ndb-cluster-connection-pool=4
ndbcluster=1
ndb-connectstring="dbmgr1:11861,dbmgr2:11861"
ndb-force-send=1
ndb-use-exact-count=0
ndb-extra-logging=1
ndb-batch-size=31536000
ndb-autoincrement-prefetch-sz=1024
engine-condition-pushdown=1

default-storage-engine=myisam

#LOGS
log-error=/var/log/acctopus/mysql/sessions.error.log
#log
slow-query-log=1
slow-query-log-file=/var/log/acctopus/mysql/sessions.slow.log

key_buffer_size = 256M
max_allowed_packet = 16M
sort_buffer_size = 512K
read_buffer_size = 256K
read_rnd_buffer_size = 512K
#thread_cache_size=1024
myisam_sort_buffer_size = 8M
memlock=0
sysdate_is_now=1
max-connections=300
thread-cache-size=128
thread_concurrency = 16

query_cache_type = 0
query_cache_size = 0
table_open_cache=1024
lower-case-table-names=0

group_concat_max_len=10M

[MYSQL]
socket=/var/lib/mysql/mysqlsessions.sock
[client]
socket=/var/lib/mysql/mysqlsessions.sock

###############################

Will upload ndb_reporter output...

Thanks Stefan

How to repeat:
Nothing special. Start over, wait some hours ...

FTP uploaded erroro reporter files to this case

Thanks.
Stefan

Could this somehow be a duplicate of the regression bug#66104? Mostly a raw guess at this moment as they both affects the BACKUP block.....

Hello,

any news on this issue?
Thanks
Stefan

Stefan,

What output would you get from 

shell> ls -laR /DB/mysql/sessions

and

shell> df /DB/mysql/session

on your test system

/Gustaf

Gustaf,

as 7.2.7 is not reliable for us, we are back to 7.1.14 and 7.1.23, which works fine.
Same installation procedure, same Systems ...

I can only give you requested the ls from another 7.2.7 Cluster, which used to be a 4 node Cluster, has had the same issues as our session cluster and which currentlly runs unter 7.1.14 (for testing). As this cluster currently runs as a 2 node cluster, I'm able to get the ls from the currently stopped 'old' 7.2.7 nodes. This Cluster is part of the hole system; the configuration looks preety the same, than the sessions cluster... just less data nodes, but the same amount of API nodes...

I've attached a file.

The df comes from the 7.1.23 session cluster:
df /DB/mysql/sessions
/DB   (/dev/dsk/c0t3C8EA9F92C6856D5d0s6):1432563648 blocks 86974516 files

or

df -h /DB/mysql/sessions
Filesystem             size   used  avail capacity  Mounted on
/dev/dsk/c0t3C8EA9F92C6856D5d0s6
                       689G   6.3G   676G     1%    /DB

Do you have any idea yet, which requires me to reconfigure the system to 7.2.7?

Thanks
Stefan

Just saw, that it is still open...
we could fix it by setting 
ulimit -n 2000
prior to the start the ndbmtd

This lifted the value of max open files, which seemd default to 255 on Solaris 10 x86