Bug #30555 | MaxNoOfOpenFiles Exceeded Then Node Shutsdown | ||
---|---|---|---|
Submitted: | 22 Aug 2007 6:11 | Modified: | 19 Feb 2009 13:13 |
Reporter: | Bradley Falzon | Email Updates: | |
Status: | Won't fix | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.0.45 | OS: | Linux (WhiteBox 4 2.6.9-5.ELsmp) |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[22 Aug 2007 6:11]
Bradley Falzon
[22 Aug 2007 6:19]
Bradley Falzon
Uploaded trace file via ftp (ftp.mysql.com/pub/mysql/upload) Filename: bug-30555.tar.gz
[22 Aug 2007 7:52]
Hartmut Holzgraefe
could you also provide the management nodes ndb_1_cluster.log file so that we have a chance to check what was going on at about the time when you got this error?
[24 Aug 2007 0:18]
Bradley Falzon
A different node crashed this morning, apart of the same nodegroup. See attached files. Due the the higher chance of occuring again, I have adjusted CLUSTERLOG STATISTICS from 7 to 15, hopefully will provide some extra details. Please advise if this is not required, so I can take back to 7. Also, I'll try and provide more information at a later stage, however, comparing the two nodes CPU Usage to the other nodes in the other node groups - the other nodes have relatively normal CPU Usage, a constant 20% System usage. However, the two nodes that have so far crashed have CPU spiking to 60-80% System constantly. Monitoring `top` seems to show only ndbd using any kind of CPU.
[4 Sep 2007 3:34]
Bradley Falzon
Another node has crashed again, unfortunately, I temporarily switch CLUSTERLOG STATISTICS=7 early this morning after some maintenance, so wasnt able to catch anything new. Files attached.
[10 Sep 2007 0:32]
Bradley Falzon
Another node has crashed over the weekend with the same error: Status: Permanent error, external action needed Message: Max number of open files exceeded, please increase MaxNoOfOpenFiles (Re source configuration error) Error: 2806 Error data: Error object: Ndbfs::createAsyncFile I have attached relevant log files, and trace files. The log files have also been created with CLUSTERLOG STATISTICS=15 as opposed to the previous 7. Thanks
[4 Oct 2007 0:17]
Bradley Falzon
Node 17 has decided to crash again last night. This time, it crash right after starting a backup (/usr/local/mysql/bin/ndb_mgm -e 'start backup') at 23:40
[28 Nov 2007 22:16]
Bradley Falzon
Had another crash on a different host today. I have included all the relevant crash files along with disk usage graphs. Graph happened at exactly the same time a backup job runs which reads the local disk. This is similar to previous bugs which have already been patched ( http://bugs.mysql.com/bug.php?id=20904 ), and if you have a look at the log files, a checkpoint is about to be ran at the same time as the back ups on the host. I have increased the Severity to Serious and this has been happening for some time now and is causing transactions to fail as the nodes are dropping out.
[28 Nov 2007 22:38]
Bradley Falzon
I have reconfigured the backup jobs (rsync) to use a bwlimit of 10kb/s, if the job takes 5 minutes or 15 minutes - it doesn't really matter at this stage.
[22 Dec 2007 20:21]
Adam Dixon
DBACC 000259 000282 006207 NDBFS 000170 --------------- Signal ---------------- r.bn: 253 "NDBFS", r.proc: 18, r.sigId: -1451972488 gsn: 261 "FSOPENREQ" prio: 0 s.bn: 248 "DBACC", s.proc: 18, s.sigId: -1451972489 length: 7 trace: 0 #sec: 0 f ragInf: 0 UserReference: H'00f80012, userPointer: H'00000000 FileNumber[1-4]: H'0000017d H'0000000f H'00000002 H'010003ff FileFlags: H'00000301 Open write only, Create new file, Truncate existing file --------------- Signal ---------------- ./ndb/src/kernel/blocks/ndbfs/Ndbfs.cpp 167 void 168 Ndbfs::execFSOPENREQ(Signal* signal) 169 { 170* jamEntry(); 171 const FsOpenReq * const fsOpenReq = (FsOpenReq *)&signal->theData[0]; 172 const BlockReference userRef = fsOpenReq->userReference; 173* AsyncFile* file = getIdleFile(); 174 ndbrequire(file != NULL); 175 ndbrequire(signal->getLength() == FsOpenReq::SignalLength) 176 file->theFileName.set( userRef, fsOpenReq->fileNumber); 177 file->reportTo(&theFromThreads); 178 179 Request* request = theRequestPool->get(); 180 request->action = Request::open; 181 request->error = 0; 182 request->par.open.flags = fsOpenReq->fileFlags; 183 request->set(userRef, fsOpenReq->userPointer, newId() ); 184 request->file = file; 185 request->theTrace = signal->getTrace(); 186 187 ndbrequire(forward(file, request)); 188 } ./ndb/src/kernel/blocks/ndbfs/Ndbfs.cpp 594 AsyncFile* 595 Ndbfs::getIdleFile(){ 596 AsyncFile* file; 597 if (theIdleFiles.size() > 0){ 598 file = theIdleFiles[0]; 599 theIdleFiles.erase(0); 600 } else { 601* file = createAsyncFile(); 602 } 603 return file; 604 } ./ndb/src/kernel/blocks/ndbfs/Ndbfs.cpp 568 AsyncFile* 569 Ndbfs::createAsyncFile(){ 570 571 // Check limit of open files 572* if (theFiles.size() == m_maxFiles) { 573 // Print info about all open files 574 for (unsigned i = 0; i < theFiles.size(); i++){ 575 AsyncFile* file = theFiles[i]; 576 ndbout_c("%2d (0x%x): %s", i, file, file->isOpen()?"OPEN":"CLOSED"); 577 } 578* ERROR_SET(fatal, NDBD_EXIT_AFS_MAXOPEN,""," Ndbfs::createAsyncFile"); 579 } It could just be that the NDBD process is nudging over teh MaxNoOfOpenFiles=60 setting. For 'some' reason.
[22 Dec 2007 20:24]
Adam Dixon
Bradley, Please gather information on the number of open files held by ndbd processes. These might help, watch for line wrapping problems. Please monitor all hosts involved, then also supply ndb_error_reporter output in full. If these files grow large, please only zip them do not truncate. #!/bin/bash while true; do for ndbdpid in `ps -ef | grep ndbd | grep -v grep | awk '{ print $2 }' `; do echo "`date` - PID: ${ndbdpid} Open Files: `ls -l /proc/$ndbdpid/fd/ | wc -l`">> /tmp/openfiles.txt; echo " " >> /tmp/openfiles.txt; done; sleep 300; echo "">>/tmp/openfiles.txt done; #!/bin/bash while true; do for ndbdpid in `ps -ef | grep ndbd | grep -v grep | awk '{ print $2 }' `; do echo "`date` - ${ndbdpid}">> /tmp/fd.txt; ls -l /proc/$ndbdpid/fd/ >> /tmp/fd.txt; done; sleep 300; echo "">>/tmp/fd.txt done;
[23 Jan 2008 2:22]
Bradley Falzon
Just as an update to this job, not in relation to Adam Dixon's request. We have currently solved this problem by moving the backups during a time of low load. Basically, we were coming across this bug when ndbd was running at 100% load (it's a dual xeon system, so the 100% load was only referring to a single cpu (as at the time of writing ndbd is "single threaded") - complete system load of 55%) and at the same time running separate rsync cron jobs to send the clusters backup - made via ndb_mgm start backup - to a remote backup host. I was able to solve this problem by rescheduling the rsync jobs before the times of 100% load. I am planning on, no commitment to date, using Adam Dixon's suggestion of while(1) { exec ("lsof > /tmp/lsof.txt"); sleep 1 } on 1 node within a node group to help developers fix and resolve this issue as it may affect other customers.
[19 Feb 2009 13:13]
Jonas Oreland
The default value of this parameter has in ndb-6.x been changed to 0, meaning no upper limit. Other than this, this is a won't fix