Bug #22406 | Falcon: Clients hang when a no disk space error occurs in background threads | ||
---|---|---|---|
Submitted: | 15 Sep 2006 20:08 | Modified: | 26 May 2010 17:48 |
Reporter: | Peter Gulutzan | Email Updates: | |
Status: | Unsupported | Impact on me: | |
Category: | MySQL Server: Falcon storage engine | Severity: | S1 (Critical) |
Version: | mysql-6.0-falcon-team | OS: | Any |
Assigned to: | Olav Sandstå | CPU Architecture: | Any |
Tags: | BLOB, F_ERROR HANDLING |
[15 Sep 2006 20:08]
Peter Gulutzan
[17 Sep 2006 20:40]
MySQL Verification Team
Thank you for the bug report. | v | +------+ | 96 | +------+ 1 row in set (12 min 40.94 sec) ERROR 2013 (HY000): Lost connection to MySQL server during query mysql> Activity on /home/miguel/dbs/5.1f/var/test.ndb: 97987 fetches, 120 reads, 98059 writes Activity on /home/miguel/dbs/5.1f/var/test.ndb: 73491 fetches, 90 reads, 73545 writes Bugcheck: write error on page 2365118 (-1/4096/26) of "/home/miguel/dbs/5.1f/var/test.ndb": No space left on device (28) Memory Records
[18 Oct 2007 22:20]
Kevin Lewis
Hakan, can this be retested? Can we get a call stack if it still fails?
[20 Nov 2007 1:17]
Hakan Küçükyılmaz
Still fails with an assertion: (gdb) f 3 #3 0x0000000000834301 in IO::writePage (this=0x2aaaaf354458, bdb=0x2aaaaf1776c8, type=4) at IO.cpp:242 242 (const char*) fileName, strerror (errno), errno); (gdb) l 237 if (length != pageSize) 238 { 239 declareFatalError(); 240 FATAL ("write error on page %d (%d/%d/%d) of \"%s\": %s (%d)", 241 bdb->pageNumber, length, pageSize, fileId, 242 (const char*) fileName, strerror (errno), errno); 243 } 244 245 ++writes; 246 ++writesSinceSync; Program received signal SIGILL, Illegal instruction. [Switching to Thread 0x42085950 (LWP 9873)] 0x00002b758de88fcb in raise () from /lib/libpthread.so.0 (gdb) bt #0 0x00002b758de88fcb in raise () from /lib/libpthread.so.0 #1 0x000000000082341e in Error::debugBreak () at Error.cpp:92 #2 0x0000000000823514 in Error::error (string=<value optimized out>) at Error.cpp:69 #3 0x0000000000834301 in IO::writePage (this=0x2aaaaf354458, bdb=0x2aaaaf1776c8, type=4) at IO.cpp:242 #4 0x000000000080666f in Cache::writePage (this=0x2aaaaf30c0f8, bdb=0x2aaaaf1776c8, type=4) at Cache.cpp:593 #5 0x0000000000843654 in PageWriter::writer (this=0x2aaaaf29e8b8) at PageWriter.cpp:186 #6 0x00000000008436bf in PageWriter::writer (arg=0x2453) at PageWriter.cpp:160 #7 0x00000000007f8f12 in Thread::thread (this=0x2aaaaf317b30) at Thread.cpp:161 #8 0x00000000007f907f in Thread::thread (parameter=0x2453) at Thread.cpp:140 #9 0x00002b758de81317 in start_thread () from /lib/libpthread.so.0 #10 0x00002b758f373b1d in clone () from /lib/libc.so.6 #11 0x0000000000000000 in ?? ()
[21 Nov 2007 4:47]
Kevin Lewis
Jim, This assert happens when a page is written to the file and fails with a disk full error. Is there a better way to recognize that the disk is out of space before writing the last page? What about pre-allocation of page ranges? Since page writing is a background activity, it seems like the system will need to shut down so that the disk space can be increased. Then restarting the engine should allow a recovery to finish out the OS file. Is there a better way to alert the user that the disk is full?
[28 Nov 2007 16:07]
Kevin Lewis
Jim Starkey wrote; Falcon currently throws a fatal error when encountering any disk error. For a true disk error, this is probably the best solution. For the more mundane "out of space" error, we can do better. I think the way to handle "disk full" errors is the would-be writer to declare an pending IO error (which increments an error counter) then go into a re-try loop with a one second timeout. In the meantime, any and all update verbs (insert, update and delete) will bounce with a pending IO error, with the original error text. If -- and when -- more disk space becomes available, Falcon will continue normal operations. Otherwise, the I/O threads will hang and updates will continue to bounce. Falcon will (i.e. should) respond to a normal storage engine shutdown. Restart will fail in recovery, of course, unless something is down about the low disk space problem. The solution is simple and localized. The hard part will be testing the orderly shutdown of Falcon in error state.
[28 Nov 2007 18:28]
Jim Starkey
"Device full" is now handled by a retry loop with a one second timeout. In the meantime, all update verbs are bounced with an appropriate error. Falcon will recover spontaneously if disk space is freed up. Otherwise, the server can be shutdown and the database subsequently recovered when disk space becomes available.
[28 Nov 2007 20:22]
Kevin Lewis
Code reviewed
[28 Nov 2007 20:43]
Hakan Küçükyılmaz
Now crashing with: Program received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x40882950 (LWP 19185)] Stream::putCharacter (this=0x0, c=-67 '�') at Stream.cpp:73 73 if (!segments || current->length >= currentLength) (gdb) #0 Stream::putCharacter (this=0x0, c=-67 '�') at Stream.cpp:73 #1 0x00000000008279f9 in EncodedDataStream::encodeBinaryBlob (this=0x2aaaaf2ad840, blobId=0) at EncodedDataStream.cpp:1288 #2 0x00000000007e9df2 in StorageInterface::encodeRecord (this=0x1283c98, buf=0x1283f31 "", updateFlag=false) at ha_falcon.cpp:2236 #3 0x00000000007e9fec in StorageInterface::write_row (this=0x1283c98, buff=0x1283f30 "�") at ha_falcon.cpp:933 #4 0x0000000000732a92 in handler::ha_write_row (this=0x1283c98, buf=0x1283f30 "�") at handler.cc:4608 #5 0x00000000006c7a43 in write_record (thd=0x1241458, table=0x1281638, info=0x4087fa30) at sql_insert.cc:1548 #6 0x00000000006cd1fb in mysql_insert (thd=0x1241458, table_list=0x127b6c0, fields=@0x128cfe0, values_list=@0x128d028, update_fields=@0x128d010, update_values=@0x128cff8, duplic=DUP_ERROR, ignore=false) at sql_insert.cc:803 #7 0x000000000065977e in mysql_execute_command (thd=0x1241458) at sql_parse.cc:2663 #8 0x00000000007aa509 in sp_instr_stmt::exec_core (this=0x127bc20, thd=0xffffffbd, nextp=0x4) at sp_head.cc:2785
[28 Nov 2007 21:07]
Hakan Küçükyılmaz
Put to S1 as it is crashing now.
[29 Nov 2007 19:17]
Kevin Lewis
This latest crash turned out to be related to a change by cpowers for Bug#32730, which was pushed immediately after Jim's change to handle 'no disk space' errors. See the explanation there.
[29 Nov 2007 22:44]
Hakan Küçükyılmaz
Fixed in mysql-6.0-falcon-team tree. When file system is full, the client just stops. Is this behavior ok?
[30 Nov 2007 1:20]
Kevin Lewis
Yes, Hakan, returning an out-of-disk-space error to all clients is the right thing to do. In this scheme, the background threads keep trying every second to get thisngs written. If the engine shuts down, then a recovery will happen. And if there is still not enough disk space, the recovery will not finish and will be redone the next time. Shutting down the progress of all current clients is necessary and appropriate.
[30 Nov 2007 10:57]
Hakan Küçükyılmaz
What I meant is that the mysql client just hangs. There is no error message, not on the client nor in the server logs. From my point of view, the current behavior is not optimal. There should be a way to figure out what happend, just letting the client hang is not that informative.
[1 Jan 2008 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[1 Jan 2008 18:08]
Peter Gulutzan
Earlier comments on this bug include: "... Falcon will (i.e. should) respond to a normal storage engine shutdown." "... In the meantime, all update verbs are bounced with an appropriate error." These things are not happening. I caused the out-of-disk-space condition so that the client hangs. This is confusing and unconventional but that's a matter for another day. I then tried to UPDATE on another connection and I got a 'hang' rather than a 'bounce with an error' I have no idea what a bounce is but I can see that there is no error message. I then tried mysqladmin shutdown and the server did not shut down. I know that shutdown can be slow with Falcon (Bug#32062), but I allowed 1/2 hour. The shutdown did not finish. So, although the crash is gone, the comments describing after-the-fix behaviour are apparently untrue, so I have set the status back to "verified".
[25 Nov 2008 17:16]
Kevin Lewis
Olav, Since you are doing some other work with reporting errors, please look at this. The current issue is that current clients hang when a no-disk-space error occurs in tha background. Instead, they should report that error.