MySQL Bugs: #30555: MaxNoOfOpenFiles Exceeded Then Node Shutsdown

Bug #30555	MaxNoOfOpenFiles Exceeded Then Node Shutsdown
Submitted:	22 Aug 2007 6:11	Modified:	19 Feb 2009 13:13
Reporter:	Bradley Falzon	Email Updates:
Status:	Won't fix	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.0.45	OS:	Linux (WhiteBox 4 2.6.9-5.ELsmp)
Assigned to:	Jonas Oreland	CPU Architecture:	Any

Description:
Single MySQL NDBD Nodes dies with the following error:

DataDir as specified in config.ini: /data/mysql-cluster/ndbd/:
Time: Wednesday 22 August 2007 - 14:49:36
Status: Permanent error, external action needed
Message: Max number of open files exceeded, please increase MaxNoOfOpenFiles (Resource configuration error)
Error: 2806
Error data: 
Error object:  Ndbfs::createAsyncFile
Program: ndbd
Pid: 21651
Trace: /data/mysql-cluster/ndbd/ndb_17_trace.log.9
Version: Version 5.0.45
***EOM***

How to repeat:
No unusual occurances at the time (average cluster usage) no extra extra logging.

Suggested fix:
Unknown

Uploaded trace file via ftp (ftp.mysql.com/pub/mysql/upload)

Filename: bug-30555.tar.gz

could you also provide the management nodes ndb_1_cluster.log
file so that we have a chance to check what was going on
at about the time when you got this error?

A different node crashed this morning, apart of the same nodegroup. See attached files. Due the the higher chance of occuring again, I have adjusted CLUSTERLOG STATISTICS from 7 to 15, hopefully will provide some extra details. Please advise if this is not required, so I can take back to 7.

Also, I'll try and provide more information at a later stage, however, comparing the two nodes CPU Usage to the other nodes in the other node groups - the other nodes have relatively normal CPU Usage, a constant 20% System usage. However, the two nodes that have so far crashed have CPU spiking to 60-80% System constantly. Monitoring `top` seems to show only ndbd using any kind of CPU.

Another node has crashed again, unfortunately, I temporarily switch CLUSTERLOG STATISTICS=7 early this morning after some maintenance, so wasnt able to catch anything new. Files attached.

Another node has crashed over the weekend with the same error:

Status: Permanent error, external action needed
Message: Max number of open files exceeded, please increase MaxNoOfOpenFiles (Re
source configuration error)
Error: 2806
Error data:
Error object:  Ndbfs::createAsyncFile

I have attached relevant log files, and trace files. The log files have also been created with CLUSTERLOG STATISTICS=15 as opposed to the previous 7.

Thanks

Node 17 has decided to crash again last night. This time, it crash right after starting a backup (/usr/local/mysql/bin/ndb_mgm -e 'start backup') at 23:40

Had another crash on a different host today.

I have included all the relevant crash files along with disk usage graphs.

Graph happened at exactly the same time a backup job runs which reads the local disk. 

This is similar to previous bugs which have already been patched ( http://bugs.mysql.com/bug.php?id=20904 ), and if you have a look at the log files, a checkpoint is about to be ran at the same time as the back ups on the host.

I have increased the Severity to Serious and this has been happening for some time now and is causing transactions to fail as the nodes are dropping out.

I have reconfigured the backup jobs (rsync) to use a bwlimit of 10kb/s, if the job takes 5 minutes or 15 minutes - it doesn't really matter at this stage.

DBACC   000259 000282 006207 
NDBFS   000170 

--------------- Signal ----------------
r.bn: 253 "NDBFS", r.proc: 18, r.sigId: -1451972488 gsn: 261 "FSOPENREQ" prio: 0
s.bn: 248 "DBACC", s.proc: 18, s.sigId: -1451972489 length: 7 trace: 0 #sec: 0 f
ragInf: 0
 UserReference: H'00f80012, userPointer: H'00000000
 FileNumber[1-4]: H'0000017d H'0000000f H'00000002 H'010003ff
 FileFlags: H'00000301 Open write only, Create new file, Truncate existing file
--------------- Signal ----------------

./ndb/src/kernel/blocks/ndbfs/Ndbfs.cpp
 167 void
 168 Ndbfs::execFSOPENREQ(Signal* signal)
 169 {
 170*  jamEntry();
 171   const FsOpenReq * const fsOpenReq = (FsOpenReq *)&signal->theData[0];
 172   const BlockReference userRef = fsOpenReq->userReference;
 173*  AsyncFile* file = getIdleFile();
 174   ndbrequire(file != NULL);
 175   ndbrequire(signal->getLength() == FsOpenReq::SignalLength)
 176   file->theFileName.set( userRef, fsOpenReq->fileNumber);
 177   file->reportTo(&theFromThreads);
 178 
 179   Request* request = theRequestPool->get();
 180   request->action = Request::open;
 181   request->error = 0;
 182   request->par.open.flags = fsOpenReq->fileFlags;
 183   request->set(userRef, fsOpenReq->userPointer, newId() );
 184   request->file = file;
 185   request->theTrace = signal->getTrace();
 186 
 187   ndbrequire(forward(file, request));
 188 }

./ndb/src/kernel/blocks/ndbfs/Ndbfs.cpp
 594 AsyncFile*
 595 Ndbfs::getIdleFile(){
 596   AsyncFile* file;
 597   if (theIdleFiles.size() > 0){
 598     file = theIdleFiles[0];
 599     theIdleFiles.erase(0);
 600   } else {
 601*    file = createAsyncFile();
 602   }
 603   return file;
 604 }

./ndb/src/kernel/blocks/ndbfs/Ndbfs.cpp
 568 AsyncFile*
 569 Ndbfs::createAsyncFile(){
 570 
 571   // Check limit of open files
 572*  if (theFiles.size() ==  m_maxFiles) {
 573     // Print info about all open files
 574     for (unsigned i = 0; i < theFiles.size(); i++){
 575       AsyncFile* file = theFiles[i];
 576       ndbout_c("%2d (0x%x): %s", i, file, file->isOpen()?"OPEN":"CLOSED");
 577     }
 578*     ERROR_SET(fatal, NDBD_EXIT_AFS_MAXOPEN,""," Ndbfs::createAsyncFile");
 579   }

It could just be that the NDBD process is nudging over teh MaxNoOfOpenFiles=60 setting. For 'some' reason.

Bradley,
Please gather information on the number of open files held by ndbd processes.

These might help, watch for line wrapping problems.

Please monitor all hosts involved, then also supply ndb_error_reporter output in full. If these files grow large, please only zip them do not truncate.

#!/bin/bash
while true; do
for ndbdpid in `ps -ef | grep ndbd | grep -v grep  | awk '{ print $2 }' `; do echo "`date` - PID: ${ndbdpid} Open Files: `ls -l /proc/$ndbdpid/fd/ | wc -l`">> /tmp/openfiles.txt; echo " " >> /tmp/openfiles.txt; done;
sleep 300;
echo "">>/tmp/openfiles.txt
done;

#!/bin/bash
while true; do
for ndbdpid in `ps -ef | grep ndbd | grep -v grep  | awk '{ print $2 }' `; do echo "`date` - ${ndbdpid}">> /tmp/fd.txt; ls -l /proc/$ndbdpid/fd/ >> /tmp/fd.txt; done;
sleep 300;
echo "">>/tmp/fd.txt
done;

Just as an update to this job, not in relation to Adam Dixon's request.

We have currently solved this problem by moving the backups during a time of low load.

Basically, we were coming across this bug when ndbd was running at 100% load (it's a dual xeon system, so the 100% load was only referring to a single cpu (as at the time of writing ndbd is "single threaded") - complete system load of 55%) and at the same time running separate rsync cron jobs to send the clusters backup - made via ndb_mgm start backup - to a remote backup host.

I was able to solve this problem by rescheduling the rsync jobs before the times of 100% load.

I am planning on, no commitment to date, using Adam Dixon's suggestion of while(1) { exec ("lsof > /tmp/lsof.txt"); sleep 1 } on 1 node within a node group to help developers fix and resolve this issue as it may affect other customers.

The default value of this parameter has in ndb-6.x been changed to 0,
meaning no upper limit.

Other than this, this is a won't fix