Bug #57936 Assertion `! is_set()' failed after out of disk space writing binlog
Submitted: 2 Nov 2010 20:01 Modified: 14 Jan 2013 21:14
Reporter: Sven Sandberg Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.1+ OS:Any
Assigned to: Matthew Lord CPU Architecture:Any
Tags: binlog, Diagnostics_area::set_ok_status, disk full

[2 Nov 2010 20:01] Sven Sandberg
Description:
When the disk becomes full while writing the binlog, it waits for someone to free space:
 - In MYSQL_LOG::open(), it calls init_io_cache with the MY_WAIT_IF_FULL flag.
 - When an event is written that is bigger than the remaining disk space, it
   produces the following call stack:
     THD::binlog_query
     MYSQL_BIN_LOG::write
     Query_log_event::write
     my_b_safe_write
     _my_b_write
     my_write
     wait_for_free_space
     my_error
     my_message_sql
     Diagnostics_area::set_error_status
   Thus, it sets the error status in the Diagnostics_area.
 - Then, it returns to wait_for_free_space and sleeps for 60 seconds.
 - Then, it returns to my_write and tries to write again, etc.
 - If someone frees up disk space, the write will succeed and my_write
   will return success.
 - The statement continues to execute without problem.
 - At the end of the statement, when it calls set_ok_status(), an assert
   is triggered because the Diagnostics_area status is already set.

How to repeat:
==== BEGIN rpl_disk_full_assertion.test ====
--source include/master-slave.inc
--source include/have_binlog_format_statement.inc
--let $master_dir= `SELECT @@datadir`
--let $master_dir= $master_dir/binlogs
--let $master_file= $master_dir/bigfile
--let $long_string= `SELECT REPEAT('a', 500000)`
--eval SELECT REPEAT('a', 750000) INTO DUMPFILE '$master_file'
CREATE TABLE t1 (a TEXT(1000000));
--disable_query_log
--send
--eval INSERT INTO t1 VALUES ('$long_string')
--enable_query_log
--sleep 5
--remove_file $master_file
--reap
==== END ====

==== BEGIN rpl_disk_full_assertion-master.opt ====
--log-bin=$MYSQLTEST_VARDIR/mysqld.1/data/binlogs/master-bin
==== END ====

==== BEGIN rpl_disk_full_assertion-master.sh ====
dir=$MYSQLTEST_VARDIR/mysqld.1/data/binlogs
img=$dir.fs
echo dir=$dir img=$img
mkdir $dir
head -c 1000000 /dev/zero > $img
mkfs.ext2 -F $img
sudo mount -o loop $img $dir
==== END ====

1. copy these scripts to mysql-test/suite/rpl/t
2. run ./mtr rpl_disk_full_assertion
3. wait for > 60 seconds, watch it crash
4. run gdb on the core. It produces this:

#0  0x00007fc7d46efcfc in pthread_kill () from /lib/libpthread.so.0
#1  0x0000000000b4ac21 in my_write_core (sig=6) at stacktrace.c:331
#2  0x000000000069f7fb in handle_segfault (sig=6) at mysqld.cc:2609
#3  <signal handler called>
#4  0x00007fc7d3ab9ba5 in raise () from /lib/libc.so.6
#5  0x00007fc7d3abd6b0 in abort () from /lib/libc.so.6
#6  0x00007fc7d3ab2a71 in __assert_fail () from /lib/libc.so.6
#7  0x0000000000680985 in Diagnostics_area::set_ok_status (this=0x3094040, thd=0x3092988, affected_rows_arg=1, last_insert_id_arg=0, message_arg=0x0) at sql_class.cc:512
#8  0x00000000005a95ea in my_ok (thd=0x3092988, affected_rows=1, id=0, message=0x0) at sql_class.h:2405
#9  0x0000000000753cdc in mysql_insert (thd=0x3092988, table_list=0x307bed0, fields=..., values_list=..., update_fields=..., update_values=..., duplic=DUP_ERROR, ignore=false) at sql_insert.cc:967
#10 0x00000000006b4fe8 in mysql_execute_command (thd=0x3092988) at sql_parse.cc:3250
#11 0x00000000006bda99 in mysql_parse (thd=0x3092988, rawbuf=0x3148428 "INSERT INTO t1 VALUES ('", 'a' <repeats 176 times>..., length=500026, found_semicolon=0x7fc7d4976960) at sql_parse.cc:6068
#12 0x00000000006afbfa in dispatch_command (command=COM_QUERY, thd=0x3092988, packet=0x30cd3d9 "INSERT INTO t1 VALUES ('", 'a' <repeats 176 times>..., packet_length=500026) at sql_parse.cc:1261
#13 0x00000000006aeb90 in do_command (thd=0x3092988) at sql_parse.cc:889
#14 0x00000000006acdd0 in handle_one_connection (arg=0x3092988) at sql_connect.cc:1136
#15 0x00007fc7d46ea971 in start_thread () from /lib/libpthread.so.0
#16 0x00007fc7d3b6c91d in clone () from /lib/libc.so.6
#17 0x0000000000000000 in ?? ()

Add DBUG_ASSERT(0) to the end of Diagnostics_area::set_error_status to get a stack trace when it sets the error status for the first time.

Suggested fix:
I don't know. Perhaps one of:

 - Make wait_for_free_space clear the diagnostics area again.

 - Make my_write clear the diagnostics area if it first calls wait_for_free_space
   and later succeeds to write.
[2 Nov 2010 20:12] Sven Sandberg
After you have run the test script above, don't forget to unmount that file system.
[14 Jan 2013 21:07] Matthew Lord
I couldn't repeat the crash with 5.5.29.

I created the loop device of 100K on /tmp/binlogs and just altered my existing test MySQL 5.5 setup to use log-bin=/tmp/binlogs.

Once I started mysqld, then I filled up that device with:
  dd if=/dev/zero of=/tmp/binlogs/waste.bin bs=1024 count=999999

I then ran several inserts into a test.loadtest table. After a few of them, then I got the expected SQL layer hang and the disk full messages in the MySQL error log.

I then removed the /tmp/binlogs file and then the SQL query returned and the binary log was updated.

I got no assert.

I will now test this with 5.1.69.
[14 Jan 2013 21:14] Matthew Lord
I was not able to repeat the crash using 5.1.65-advanced, either.

I used the same 100K loop device on /tmp/binlogs.

I then started my existing 5.1 test instance, modifying:
  log-bin=/tmp/binlogs/foo

I then filled up the device with dd:
  dd if=/dev/zero of=/tmp/binlogs/waste.bin bs=1024 count=999999

I then created some dummy records in a test.test2 table.  After several cases then I got the expected messages in the error log and query hang.

I then deleted the waste.bin file and things picked up fine again.

I was not able to get the failed ASSERT and crash in recent 5.5 or 5.1 Enterprise releases.
[15 Jan 2013 9:31] Sven Sandberg
I confirm that this seems to have been fixed already. It seems that wait_for_free_space now logs a warning instead of raises an error, so that should avoid the problem.