MySQL Bugs: #22082: Slave hangs(holds mutex) on "disk full"

Bug #22082	Slave hangs(holds mutex) on "disk full"
Submitted:	7 Sep 2006 14:04	Modified:	18 Mar 2009 15:14
Reporter:	Jonas Oreland	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	Tried on 5.0.25, 5.0.51, 5.0.67	OS:	Any
Assigned to:	Zhenxing He	CPU Architecture:	Any

Description:
If letting slave get full disk
It sends out the following warning: "Disk is full writing"
 Waiting for someone to free space... Retry in 60 secs

1) During this 60s "stop slave", "show slave status" hangs for the duration
2) If I let it sleep 60s it printed warning again...but I have now waited 5minutes
   without it printing this, and I have a client with hanging 
   in "show slave status"
3) I was nice, and released some disk space...still hanging

So I have to kill it...
Then when I restart it it have gotten the following _all_ the times I tried:
060907 15:50:09 [ERROR] Slave: Could not parse relay log event entry. The possible reasons are: the master's binary log is corrupted (you can check this by running 'mysqlbinlog' on the binary log), the slave's relay log is corrupted (you can check this by running 'mysqlbinlog' on the relay log), a network problem, or a bug in the master's or slave's MySQL code. If you want to check the master's binary log or slave's relay log, you will be able to know their names by issuing 'SHOW SLAVE STATUS' on this slave. Error_code: 0
060907 15:50:09 [ERROR] Error running query, slave SQL thread aborted. Fix the problem, and restart the slave SQL thread with "SLAVE START". We stopped at log 'perch-bin.000001' position 2010981

I.e it seems like it writes "incomplete" data to disk before it gets full...

How to repeat:
dd if=/dev/zero of=smallfs.dat bs=1024k count=3
mkfs.ext2 smallfs.dat
mount -o loop smallfs.dat mysqld-var-dir

start slave with this "mysqld-var-dir" as datadir
produce enough binlog on master, and wait for disk to get full on slave
try issuing "show slave status"

Suggested fix:
1) Dont write incomplete data to disk
2) Either stop slave (instead of "retrying"
or make sure to release mutex while waiting the 60s
3) Fix so that it can wait more than 1*60s

Same behavior with current version and sync-binlog

Talked to Zhenxing about this bug:

SUMMARIZING PROBLEMS
====================
(Same numbering as Jonas had)

1. SHOW SLAVE STATUS can't be done when disk full
2. Sleeping is longer than 60s
3. Still hangs after sleep
4. If slave is killed, then relay log corrupted.

SOLUTION
========
1. It is expected behaviour that SHOW SLAVE STATUS
   is hanging when the disk is full.  After the disk
   is freed, then the command returns as expected.
2. Zhenxing has checked the code that reprint of message is
   longer after a while.  This should be made clear in the error
   messages.
3. Zhenxing has confirm that it is not hanging.  After the sleep,
   it does continue as it should (after the 60 second wait).
   We have not been able to reproduce the failure that
   Jonas describes (provided that he did wait the 60s).
4. In 6.0 there is a recovery mechanism that will delete any
   corrupted relay log.  In 5.0-5.1, there is no such feature
   implemented and slaves are not fully crash-safe.  If the
   server is killed, then the relay log may be corrupted.

ACTION
======
To close this bug, we will only improve on the error message so
that the time for the sleep messages are correctly specified:
- Add something like this for the first message:
  "Expect 60 seconds delay for server to continue after
  the disk space has been freed".
- The new message will be something like:
  "Retrying every 60 secs.  Message will be reprinted in 600 secs."

If someone can reproduce the problem with replication not
continuing after the disk has been freed and one has waited over
60s, then we will re-open the bug.  Zhenxing tested and can't
repeat that problem.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67432

2731 He Zhenxing	2009-02-25
      BUG#22082 Slave hangs(holds mutex) on "disk full"
      
      When disk is full, server may waiting for free space while
      writing binlog, relay-log or MyISAM tables. The server will 
      continue after user have freed some space. But the error
      message printed was not quite clear about the how often the
      error message is printed, and there will be a delay before
      the server continue and user freeing space. And caused users
      thinking that the server was hanging forever.
      
      This patch fixed the problem by making the error messages
      printed more clear. The error message is split into two part,
      the first part will only be printed once, and the second part
      will be printed very 10 times.
      
      Message first part:
      Disk is full writing '<filename>' (Errcode: <errorno>). Waiting
      for someone to free space... (Expect 60 secs delay for server
      to continue after freeing disk space)
      
      Message second part:
      Retry in 60 secs, Message reprinted in 600 secs

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67433

2731 He Zhenxing	2009-02-25
      BUG#22082 Slave hangs(holds mutex) on "disk full"
      
      When disk is full, server may waiting for free space while
      writing binlog, relay-log or MyISAM tables. The server will 
      continue after user have freed some space. But the error
      message printed was not quite clear about the how often the
      error message is printed, and there will be a delay before
      the server continue and user freeing space. And caused users
      thinking that the server was hanging forever.
      
      This patch fixed the problem by making the error messages
      printed more clear. The error message is split into two part,
      the first part will only be printed once, and the second part
      will be printed very 10 times.
      
      Message first part:
      Disk is full writing '<filename>' (Errcode: <errorno>). Waiting
      for someone to free space... (Expect 60 secs delay for server
      to continue after freeing disk space)
      
      Message second part:
      Retry in 60 secs, Message reprinted in 600 secs

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67437

2731 He Zhenxing	2009-02-25
      BUG#22082 Slave hangs(holds mutex) on "disk full"
      
      When disk is full, server may waiting for free space while
      writing binlog, relay-log or MyISAM tables. The server will 
      continue after user have freed some space. But the error
      message printed was not quite clear about the how often the
      error message is printed, and there will be a delay before
      the server continue and user freeing space. And caused users
      thinking that the server was hanging forever.
      
      This patch fixed the problem by making the error messages
      printed more clear. The error message is split into two part,
      the first part will only be printed once, and the second part
      will be printed very 10 times.
      
      Message first part:
      Disk is full writing '<filename>' (Errcode: <errorno>). Waiting
      for someone to free space... (Expect 60 secs delay for server
      to continue after freeing disk space)
      
      Message second part:
      Retry in 60 secs, Message reprinted in 600 secs

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67945

2731 He Zhenxing	2009-03-02
      BUG#22082 Slave hangs(holds mutex) on "disk full"
      
      When disk is full, server may waiting for free space while
      writing binlog, relay-log or MyISAM tables. The server will 
      continue after user have freed some space. But the error
      message printed was not quite clear about the how often the
      error message is printed, and there will be a delay before
      the server continue and user freeing space. And caused users
      thinking that the server was hanging forever.
      
      This patch fixed the problem by making the error messages
      printed more clear. The error message is split into two part,
      the first part will only be printed once, and the second part
      will be printed very 10 times.
      
      Message first part:
      Disk is full writing '<filename>' (Errcode: <errorno>). Waiting
      for someone to free space... (Expect 60 secs delay for server
      to continue after freeing disk space)
      
      Message second part:
      Retry in 60 secs, Message reprinted in 600 secs

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67954

2731 He Zhenxing	2009-03-02
      BUG#22082 Slave hangs(holds mutex) on "disk full"
      
      When disk is full, server may waiting for free space while
      writing binlog, relay-log or MyISAM tables. The server will 
      continue after user have freed some space. But the error
      message printed was not quite clear about the how often the
      error message is printed, and there will be a delay before
      the server continue and user freeing space. And caused users
      thinking that the server was hanging forever.
      
      This patch fixed the problem by making the error messages
      printed more clear. The error message is split into two part,
      the first part will only be printed once, and the second part
      will be printed very 10 times.
      
      Message first part:
      Disk is full writing '<filename>' (Errcode: <errorno>). Waiting
      for someone to free space... (Expect up to 60 secs delay for 
      server to continue after freeing disk space)
      
      Message second part:
      Retry in 60 secs, Message reprinted in 600 secs

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67955

2731 He Zhenxing	2009-03-02
      BUG#22082 Slave hangs(holds mutex) on "disk full"
      
      When disk is full, server may waiting for free space while
      writing binlog, relay-log or MyISAM tables. The server will 
      continue after user have freed some space. But the error
      message printed was not quite clear about the how often the
      error message is printed, and there will be a delay before
      the server continue and user freeing space. And caused users
      thinking that the server was hanging forever.
      
      This patch fixed the problem by making the error messages
      printed more clear. The error message is split into two part,
      the first part will only be printed once, and the second part
      will be printed very 10 times.
      
      Message first part:
      Disk is full writing '<filename>' (Errcode: <errorno>). Waiting
      for someone to free space... (Expect up to 60 secs delay for 
      server to continue after freeing disk space)
      
      Message second part:
      Retry in 60 secs, Message reprinted in 600 secs

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/68456

2766 He Zhenxing	2009-03-06
      BUG#22082 Slave hangs(holds mutex) on "disk full"
      
      When disk is full, server may waiting for free space while
      writing binlog, relay-log or MyISAM tables. The server will 
      continue after user have freed some space. But the error
      message printed was not quite clear about the how often the
      error message is printed, and there will be a delay before
      the server continue and user freeing space. And caused users
      thinking that the server was hanging forever.
      
      This patch fixed the problem by making the error messages
      printed more clear. The error message is split into two part,
      the first part will only be printed once, and the second part
      will be printed very 10 times.
      
      Message first part:
      Disk is full writing '<filename>' (Errcode: <errorno>). Waiting
      for someone to free space... (Expect up to 60 secs delay for 
      server to continue after freeing disk space)
      
      Message second part:
      Retry in 60 secs, Message reprinted in 600 secs

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/68460

2836 He Zhenxing	2009-03-06 [merge]
      Merge BUG#22082 from 5.0-bugteam to 5.1-bugteam

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/68590

3089 He Zhenxing	2009-03-09 [merge]
      Auto merge BUG#22082 from 5.1-bugteam to 6.0-bugteam

pushed to 5.0/5.1/6.0-bugteam

Pushed into 5.0.79 (revid:joro@sun.com-20090309135922-a0di9ebkxoj4d4wv) (version source revid:zhenxing.he@sun.com-20090306093200-4u6mq0jcu8ubcmqf) (merge vers: 5.0.79) (pib:6)

Documented in the 5.0.79 changelog as follows:

        When its disk becomes full, a replication slave waits while
        writing the binary log, relay log or MyISAM tables, continuing
        after space has been made available. The error message provided
        in such cases was not clear about the frequency with which
        checking for free space is done (once every 60 seconds), and how
        long the server waits after space has been freed before
        continuing (also 60 seconds); this caused users to think that
        the server had hung.

        These issues have been addressed by making the error message
        clearer, and dividing it into two separate messages:

        1.  The error message Disk is full writing 'filename' (Errcode:
            error_code). Waiting for someone to free space... (Expect up 
            to 60 secs delay for server to continue after freeing disk 
            space) is printed only once.

        2.  The warning Retry in 60 secs, Message reprinted in 600 secs 
            is printed once every for every 10 times that the check for 
            free space is made; that is, the check is performed once each 
            60 seconds, but the reminder that space needs to be freed is 
            printed only once every 10 minutes (600 seconds).

Set status to NDI pending merges to 5.1 and 6.0 trees.

Pushed into 5.1.33 (revid:joro@sun.com-20090313111355-7bsi1hgkvrg8pdds) (version source revid:zhou.li@sun.com-20090311061050-ihp0g77znonq1tuq) (merge vers: 5.1.33) (pib:6)

Fix also noted in the 5.1.33 changelog; set back to NDI status pending merge to 6.0 tree.

Pushed into 6.0.11-alpha (revid:joro@sun.com-20090318122208-1b5kvg6zeb4hxwp9) (version source revid:matthias.leich@sun.com-20090310140952-gwtoq87wykhji3zi) (merge vers: 6.0.11-alpha) (pib:6)

Fix also documented in 6.0.11 changelog; closed.

Pushed into 5.1.34-ndb-6.2.18 (revid:jonas@mysql.com-20090508185236-p9b3as7qyauybefl) (version source revid:jonas@mysql.com-20090508100057-30ote4xggi4nq14v) (merge vers: 5.1.33-ndb-6.2.18) (pib:6)

Pushed into 5.1.34-ndb-6.3.25 (revid:jonas@mysql.com-20090509063138-1u3q3v09wnn2txyt) (version source revid:jonas@mysql.com-20090508175813-s6yele2z3oh6o99z) (merge vers: 5.1.33-ndb-6.3.25) (pib:6)

Pushed into 5.1.34-ndb-7.0.6 (revid:jonas@mysql.com-20090509154927-im9a7g846c6u1hzc) (version source revid:jonas@mysql.com-20090509073226-09bljakh9eppogec) (merge vers: 5.1.33-ndb-7.0.6) (pib:6)

Added the following entry to the MySQL 5.5.50, 5.6.31, 5.7.13, and 5.8.0 changelogs:

"When the disk for saving the relay log was full, clients might get no responses for SHOW SLAVE STATUS statements, making it difficult to access status of the slave server. It was caused by the slave's IO thread holding some locks while waiting for disk space, so that the SQL thread was blocked. With this fix, the IO thread no longer holds the locks while waiting for disk space, so results can be returned for SHOW SLAVE STATUS statements."

Please ignore the last remark by Daniel So--it was inserted by mistake. Sorry for that!