Bug #66363 Large transaction writing rbr with sync_binlog hangs entire server on fsync
Submitted: 13 Aug 2012 21:33 Modified: 13 Mar 2013 22:27
Reporter: Trey Raymond Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.1.58, 5.1.67 OS:Linux (rhel 6.1)
Assigned to: CPU Architecture:Any
Triage: Needs Triage: D1 (Critical)

[13 Aug 2012 21:33] Trey Raymond
HP DL180 G5, 2 x Xeon L5420 2.50GHz, 15.7GB / 16GB 667MHz DDR2, 6 x 450GB-15K SAS (RAID 10, XFS)

This host is the master of a row-based replication system.  sync_binlog=1 on the host.  Earlier today, a user (who will be scolded!) ran "delete from table_name;" on a table (innodb) with approx. 150M records, ~18G of raw data (table was of course larger).  I can see from monitoring that the statement ran for ~70 minutes, and appears to have deleted every record successfully, though it never returned to the client.  At this time, when it would be writing the committed transaction to the binary log and syncing it, the mysqld instance became unresponsive - even login attempts would just hang with no error or even a connect_timeout rejection.

Host monitoring shows that disk writes peaked strongly right at commit time, then went down but continued to be significant for approximately 80 minutes.  Nothing else significant was active on the host...heartbeat event would have been running if event scheduler was not hung, but nothing could connect.  One core of the CPU was at 100% for this full 80 minutes.  I assume this is the time spent writing the large row based binary log out to disk...and it was holding some sort of global mutex preventing any other use of the server at all.  While this is dangerous, one would think that after this write was done, the thread would release it and usage would return to normal...but it did not.  I believe something with the large fsync (with sync_binlog) caused it to hold this global lock.

The server was unresponsive for three hours past the cease in disk/cpu activity.  It was reported to me, and I took a look - one giant 18G binary log (as max_binlog_size does not apply with a single transaction), which matched the size of the data being deleted, and a completely inaccessible server.  The delete statement had still not returned to the client.  I at first called a normal mysqld stop, which caused the Normal Shutdown line to be written to mysqld.err (o other info had been written to mysqld.err), however the process still did not die, and I had to use the last-ditch kill -9 to get rid of it.  After this, the server started normally.

When I started the slave of this host, it pulled the large binary log (which had successfully been written in its entirety, even if the fsync hung) and executed the full delete in ~90 minutes with no issue, then returned to normal.

How to repeat:
Run an enormous write transaction on a row-based master with sync_binlog=1.  I'm trying to find a threshold, or repeat it without sync_binlog to see if the server still hangs (on a mutex?) during the binary log write, and will update this with my results, but these things take time just to load up.

Suggested fix:
Avoid a server-wide lock from binary log writes.  Blocking other transactions that wrote data from committing would be acceptable, though there must be some way to improve this without losing the safety of sync_binlog - considering even just writing the binary log more than doubled the time of the transaction.  But really, there's no reason that an fsync on a binary log should block a user from connecting, and this problem means that any large write transaction can cause a complete outage, hence the "serious" priority.
[16 Aug 2012 19:12] Trey Raymond
From further testing...it doesn't seem to be dependent on sync_binlog, the issue happens even when it's set to 0.  I ran a bunch of tests insert...select'ing 40mil rows from one table into another identical table.  Every time, the server behaved fine during the insert, then it finished inserting and briefly went into 'freeing items' state, and then the host became unresponsive.  IO use shot up (expected, with a large flush) and a single core was pretty heavily into iowait during the hang, but that's no reason that the whole system should become unresponsive.  It must be acquiring some sort of global mutex.
[16 Aug 2012 19:31] Shane Bester
I guess global mutex is LOCK_log while copying from tmpdir->binlog.
[13 Mar 2013 22:27] Sveta Smirnova
Thank you for the report.

Verified as described.

Delay in my case.

$ cat result.log 
while true; do mysql -uroot -S /tmp/mysql_ssmirnova5.sock -e "select now()" >> bug66363_1.log; sleep 1; done
 2131 now()
 2132 2013-03-13 19:10:54
 2133 now()
 2134 2013-03-13 19:12:24

 9958 2013-03-13 19:45:47
 9959 now()
 9960 2013-03-13 19:45:47
 9961 now()
 9962 2013-03-13 19:45:57
 9963 now()
 9964 2013-03-13 19:45:57
 9965 now()

while true; do mysql -uroot -S /tmp/mysql_ssmirnova5.sock -e "select now()" test >> bug66363_2.log; sleep 1; done

3872 2013-03-13 19:45:47
3873 now()
3874 2013-03-13 19:45:57
3875 now()

To repeat:

1. Start server with options:

./libexec/mysqld --defaults-file=support-files/my-small.cnf --basedir=. --datadir=./data --innodb-file-per-table --log-bin --binlog-format=row --socket=/tmp/mysql_ssmirnova5.sock --port=33051 --log-error --innodb_buffer_pool_size=8G &

2. Use table from bug #43968:

CREATE TABLE `test_log` (
  `SEQ` bigint(20) unsigned NOT NULL AUTO_INCREMENT,
  `TEST_ID` int(11) NOT NULL,
  `TEST_NAME` varchar(20) COLLATE utf8_bin NOT NULL,

Run test case from bug #43968 until table is approximately 4.5G

3. Run in parallel session two jobs which SELECT NOW() and log results:

while true; do mysql -uroot -S /tmp/mysql_ssmirnova5.sock -e "select now()" >> bug66363_1.log; sleep 1; done &

while true; do mysql -uroot -S /tmp/mysql_ssmirnova5.sock -e "select now()" test >> bug66363_2.log; sleep 1; done

4. Run huge DELETE:

mysql> begin;
Query OK, 0 rows affected (0.00 sec)

mysql> delete from test_log;
Query OK, 32171867 rows affected (32 min 20.19 sec)

mysql> commit;
Query OK, 0 rows affected (16.78 sec)

5. Stop jobs which log SELECT NOW() output into files.

6. Create LEX file:

$ cat verify.l
%option noyywrap warn

	int oldtime = 0;
	int time = 0;

"2013-03-13 " { BEGIN IDATE; }
<IDATE>[0-9][0-9] { time += atoi(yytext)*3600; }
<IHOUR>[0-9][0-9] { time += atoi(yytext)*60; }
<IMINUTE>[0-9][0-9] { 
	time += atoi(yytext);
	if ( 0 != oldtime && 1 < (time - oldtime)) {
		fprintf(yyout, "Difference found: difference: - %8d sec, time - %8d, oldtime - %8d\n", time - oldtime, time, oldtime); 
	oldtime = time;
	time = 0;
"now()" { }
\n {}
. {}

main(int argc, char **argv)
  printf("Parsing finished\n");

7. Compile it:

$ flex verify.l 
$ gcc lex.yy.c -o verify 

8. Run it on log files:

$ ./verify <bug66363_1.log 
Difference found: difference: -        2 sec, time -    68045, oldtime -    68043
Difference found: difference: -       90 sec, time -    69144, oldtime -    69054
Difference found: difference: -        2 sec, time -    71145, oldtime -    71143
Difference found: difference: -       10 sec, time -    71157, oldtime -    71147
Difference found: difference: -        2 sec, time -    71245, oldtime -    71243
Parsing finished

$ ./verify <bug66363_2.log 
Difference found: difference: -        2 sec, time -    69203, oldtime -    69201
Difference found: difference: -       10 sec, time -    71157, oldtime -    71147
Difference found: difference: -        2 sec, time -    71403, oldtime -    71401
Parsing finished

9. Observe the issue.
[13 Mar 2013 22:31] Sveta Smirnova
Huger test crashed debug server for me:

mysqld: sql_class.cc:503: void Diagnostics_area::set_ok_status(THD*, ha_rows, ulonglong, const char*): Assertion `! is_set()' failed.
22:52:44 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 49930 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x2aacd00008e8
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 45bb5108 thread_stack 0x20000

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (1de8f1c8): commit
Connection ID (thread ID): 74668

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

For bigger test I created a table like:

  `f1` blob

Then filled it, so size of *ibd file is about 450MB

Then I created 9 copies of this table: CREATE TABLE t1 LIKE t0; INSERT INTO t1 SELECT * FROM t0;, then run following transaction:

$cat bug66363.sql
delete from t1;
delete from t2;
delete from t3;
delete from t4;
delete from t5;
delete from t6;
delete from t7;
delete from t8;
delete from t9;

Server crashed when commit was executed.