MySQL Bugs: #22406: Falcon: Clients hang when a no disk space error occurs in background threads

Bug #22406	Falcon: Clients hang when a no disk space error occurs in background threads
Submitted:	15 Sep 2006 20:08	Modified:	26 May 2010 17:48
Reporter:	Peter Gulutzan	Email Updates:
Status:	Unsupported	Impact on me:	None
Category:	MySQL Server: Falcon storage engine	Severity:	S1 (Critical)
Version:	mysql-6.0-falcon-team	OS:	Any
Assigned to:	Olav Sandstå	CPU Architecture:	Any
Tags:	BLOB, F_ERROR HANDLING

Description:
I create a table with a LONGBLOB column.
I repeatedly INSERT big rows until there is no more space.
I expect an error message.
Instead, I get a crash.

ChangeSet@1.2292.

How to repeat:
delimiter //
create table tv (s1 longblob) engine=falcon//
set @@max_allowed_packet = 1000000000//
create procedure pv ()
  begin
  declare v int default 0;
  while v < 1000 do
    select v;
    insert into tv values (repeat('a',100000000));
    set v = v + 1;
  end while;
end// 
call pv()//

Eventually -- the exact time depends on how much space is available --
the client will display
ERROR 2013 (HY000): Lost connection to MySQL server during query

The final messages from the server were:

Stalled threads
  Thread 1f22320 (1091615072) sleep=0, grant=0, locks=0, who 0, javaThread 0
      prior=8f8f8f8f8f8f8f8f, next=8f8f8f8f8f8f8f8f, parent=0
    Pending IO::writePage state 0 (1) syncObject 1f3c8f0
  Thread 1f510c0 (1096083808) sleep=0, grant=1, locks=0, who 1, javaThread 0
      prior=1f45800, next=1f50380, parent=1f1f910
Stalled synchronization objects:
  SyncObject 1f3c8f0: state -1, monitor 0, waiters 1
    Exclusive thread 1f510c0 (1096083808), type 2;
    Waiting thread 1f22320 (1091615072), type 1; IO::writePage
------------------------------------
Stalled threads
  Thread 1f60b20 (1100286304) sleep=0, grant=0, locks=0, who 0, javaThread 0
      prior=1f50380, next=1fb4a90, parent=1f1f910
Stalled synchronization objects:
------------------------------------
Bugcheck: write error on page 1481717 (-1/4096/27) of "/usr/local/mysql/var/f18.ndb": No space left on device (28)
Memory
Records
mysqld got signal 4;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=8388572
read_buffer_size=131072
max_used_connections=3
max_connections=100
threads_connected=2
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 225788 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thank you for the bug report.

| v    |
+------+
|   96 | 
+------+
1 row in set (12 min 40.94 sec)

ERROR 2013 (HY000): Lost connection to MySQL server during query
mysql> 

Activity on /home/miguel/dbs/5.1f/var/test.ndb: 97987 fetches, 120 reads, 98059 writes
Activity on /home/miguel/dbs/5.1f/var/test.ndb: 73491 fetches, 90 reads, 73545 writes
Bugcheck: write error on page 2365118 (-1/4096/26) of "/home/miguel/dbs/5.1f/var/test.ndb": No space left on device (28)
Memory
Records

Hakan, can this be retested?  Can we get a call stack if it still fails?

Still fails with an assertion:

(gdb) f 3
#3  0x0000000000834301 in IO::writePage (this=0x2aaaaf354458, bdb=0x2aaaaf1776c8, type=4) at IO.cpp:242
242                                     (const char*) fileName, strerror (errno), errno);
(gdb) l
237             if (length != pageSize)
238                     {
239                     declareFatalError();
240                     FATAL ("write error on page %d (%d/%d/%d) of \"%s\": %s (%d)",
241                                     bdb->pageNumber, length, pageSize, fileId,
242                                     (const char*) fileName, strerror (errno), errno);
243                     }
244
245             ++writes;
246             ++writesSinceSync;

Program received signal SIGILL, Illegal instruction.
[Switching to Thread 0x42085950 (LWP 9873)]
0x00002b758de88fcb in raise () from /lib/libpthread.so.0
(gdb) bt
#0  0x00002b758de88fcb in raise () from /lib/libpthread.so.0
#1  0x000000000082341e in Error::debugBreak () at Error.cpp:92
#2  0x0000000000823514 in Error::error (string=<value optimized out>) at Error.cpp:69
#3  0x0000000000834301 in IO::writePage (this=0x2aaaaf354458, bdb=0x2aaaaf1776c8, type=4) at IO.cpp:242
#4  0x000000000080666f in Cache::writePage (this=0x2aaaaf30c0f8, bdb=0x2aaaaf1776c8, type=4) at Cache.cpp:593
#5  0x0000000000843654 in PageWriter::writer (this=0x2aaaaf29e8b8) at PageWriter.cpp:186
#6  0x00000000008436bf in PageWriter::writer (arg=0x2453) at PageWriter.cpp:160
#7  0x00000000007f8f12 in Thread::thread (this=0x2aaaaf317b30) at Thread.cpp:161
#8  0x00000000007f907f in Thread::thread (parameter=0x2453) at Thread.cpp:140
#9  0x00002b758de81317 in start_thread () from /lib/libpthread.so.0
#10 0x00002b758f373b1d in clone () from /lib/libc.so.6
#11 0x0000000000000000 in ?? ()

Jim, This assert happens when a page is written to the file and fails with a disk full error.  Is there a better way to recognize that the disk is out of space before writing the last page?  What about pre-allocation of page ranges?  Since page writing is a background activity, it seems like the system will need to shut down so that the disk space can be increased.  Then restarting the engine should allow a recovery to finish out the OS file.  Is there a better way to alert the user that the disk is full?

Jim Starkey wrote;
Falcon currently throws a fatal error when encountering any disk error.  
For a true disk error, this is probably the best solution.  For the more 
mundane "out of space" error, we can do better.

I think the way to handle "disk full" errors is the would-be writer to 
declare an pending IO error (which increments an error counter) then go 
into a re-try loop with a one second timeout.  In the meantime, any and 
all update verbs (insert, update and delete) will bounce with a pending 
IO error, with the original error text.  If -- and when -- more disk 
space becomes available, Falcon will continue normal operations. 
Otherwise, the I/O threads will hang and updates will continue to 
bounce.  Falcon will (i.e. should) respond to a normal storage engine 
shutdown.   Restart will fail in recovery, of course, unless something 
is down about the low disk space problem.

The solution is simple and localized.  The hard part will be testing the 
orderly shutdown of Falcon in error state.

"Device full" is now handled by a retry loop with a one second timeout.  In the meantime, all update verbs are bounced with an appropriate error.  Falcon will recover spontaneously if disk space is freed up.  Otherwise, the server can be shutdown and the database subsequently recovered when disk space becomes available.

Code reviewed

Now crashing with:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x40882950 (LWP 19185)]
Stream::putCharacter (this=0x0, c=-67 '�') at Stream.cpp:73
73              if (!segments || current->length >= currentLength)
(gdb) 

#0  Stream::putCharacter (this=0x0, c=-67 '�') at Stream.cpp:73
#1  0x00000000008279f9 in EncodedDataStream::encodeBinaryBlob (this=0x2aaaaf2ad840, blobId=0)
    at EncodedDataStream.cpp:1288
#2  0x00000000007e9df2 in StorageInterface::encodeRecord (this=0x1283c98, buf=0x1283f31 "", updateFlag=false)
    at ha_falcon.cpp:2236
#3  0x00000000007e9fec in StorageInterface::write_row (this=0x1283c98, buff=0x1283f30 "�") at ha_falcon.cpp:933
#4  0x0000000000732a92 in handler::ha_write_row (this=0x1283c98, buf=0x1283f30 "�") at handler.cc:4608
#5  0x00000000006c7a43 in write_record (thd=0x1241458, table=0x1281638, info=0x4087fa30) at sql_insert.cc:1548
#6  0x00000000006cd1fb in mysql_insert (thd=0x1241458, table_list=0x127b6c0, fields=@0x128cfe0,
    values_list=@0x128d028, update_fields=@0x128d010, update_values=@0x128cff8, duplic=DUP_ERROR, ignore=false)
    at sql_insert.cc:803

#7  0x000000000065977e in mysql_execute_command (thd=0x1241458) at sql_parse.cc:2663
#8  0x00000000007aa509 in sp_instr_stmt::exec_core (this=0x127bc20, thd=0xffffffbd, nextp=0x4)
    at sp_head.cc:2785

Put to S1 as it is crashing now.

This latest crash turned out to be related to a change by cpowers for Bug#32730, which was pushed immediately after Jim's change to handle 'no disk space' errors.  See the explanation there.

Fixed in mysql-6.0-falcon-team tree. When file system is full, the client just stops.

Is this behavior ok?

Yes, Hakan, returning an out-of-disk-space error to all clients is the right thing to do.  In this scheme, the background threads keep trying every second to get thisngs written.  If the engine shuts down, then a recovery will happen.  And if there is still not enough disk space, the recovery will not finish and will be redone the next time.  Shutting down the progress of all current clients is necessary and appropriate.

What I meant is that the mysql client just hangs. There is no error message, not on the client nor in the server logs. From my point of view, the current behavior is not optimal. There should be a way to figure out what happend, just letting the client hang is not that informative.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

Earlier comments on this bug include:
"... Falcon will (i.e. should) respond to a normal storage engine
shutdown."
"... In the meantime, all update verbs are bounced with an appropriate
error."

These things are not happening.

I caused the out-of-disk-space condition so that the client hangs.
This is confusing and unconventional but that's a matter for another day.

I then tried to UPDATE on another connection and I got a 'hang' rather
than a 'bounce with an error' I have no idea what a bounce is but I can
see that there is no error message.

I then tried mysqladmin shutdown and the server did not shut down.
I know that shutdown can be slow with Falcon (Bug#32062),
but I allowed 1/2 hour. The shutdown did not finish.

So, although the crash is gone, the comments describing after-the-fix
behaviour are apparently untrue, so I have set the status back to "verified".

Olav, Since you are doing some other work with reporting errors, please look at this.  The current issue is that current clients hang when a no-disk-space error occurs in tha background.  Instead, they should report that error.