MySQL Bugs: #49202: Data loss and bogus errors when restoring from an incomplete BACKUP file

Bug #49202	Data loss and bogus errors when restoring from an incomplete BACKUP file
Submitted:	30 Nov 2009 11:08	Modified:	7 Jan 2010 1:03
Reporter:	Philip Stoev	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Backup	Severity:	S1 (Critical)
Version:	6.0-backup	OS:	Any
Assigned to:	Paul DuBois	CPU Architecture:	Any

Description:
When a BACKUP operation is KILL-ed, a backup file is still produced and remains on disk. If the file is small (that is, the header was only partially written), the RESTORE will readily detect that the file is corrupt, however if the file is larger (that is, there is some data in the file), RESTORE will fail with various error messages such as

ERROR 1680 (HY000): Can't shut down MyISAM restore driver(s)

ERROR 1671 (HY000): Error when reading metadata

ERROR 1655 (HY000): Can't read backup location '/tmp/gentest26535.tmp'

Got error 176 when reading from logfile

So, we have the following issues:

* BACKUP leaves orphan backup files on the filesystem;

* There are too many error messages for the same underlying issue. I am concerned that any integrity checks for the backup are not operating properly.

* Restore is a destructive operation. After the "Can't shut down MyISAM restore driver(s)" message, the database is in some undefined state, since the stuff that existed before the RESTORE was wiped out and at the same time RESTORE failed.

How to repeat:
Try to restore from the attached backup file

Suggested fix:
* remove any partial backup files, e.g. those caused by a killed backup, or by some backup in the error. No backup is better than a corrupted backup.

* make sure that all instances of incomplete or corrupted backups are detected via a robust mechanism, rather than relying on driver and/or OS errors to detect the corruption.

backup file to restore from

http://mysql-systemqa.s3.amazonaws.com/bug49202.backup.zip

Also note that the error log says:

Got error 176 when reading from logfile
091130 14:06:39 [ERROR] Restore: Can't shut down MyISAM restore driver(s)
091130 14:06:39 [Warning] Restore: Operation aborted - data might be corrupted

In my humble opinion "data might be corrupted" is not a warning that should be dumped in the error log, it must be an error message sent straight to the user.

Part of this bug report is similar to BUG#36931 (Data integrity verification of Backup file not possible). MySQL Backup feature leaves the backup operation in an incredible state as RESTORE is destructive operation.

I executed a test where a similar issue is seen as reported in this bug. Restore fails because of full disk and the error message that I got was,
ERROR 1699 (HY000): Error when reading summary section of backup image

There are 2 things to be noted here:
1. Restore failed and eventually all the database contents are lost from the server
2. The error message indicated is not self sufficient to understand the issue on why restore is failing

It is essential that restore fails by providing appropriate error messages to user.

Here is what is in the mysql.backup_history table shows for a killed backup:

mysql> select * from backup_history\G
*************************** 1. row ***************************
          backup_id: 276
         process_id: 0
   binlog_start_pos: 0
        binlog_file:
       backup_state: error
          operation: backup
          error_num: 0
        num_objects: 2
        total_bytes: 3615
validity_point_time: 0000-00-00 00:00:00
         start_time: 2009-12-02 09:21:56
          stop_time: 2009-12-02 09:22:30
host_or_server_name: localhost
           username: root
        backup_file: backup
   backup_file_path: /tmp/
       user_comment:
            command: backup database test to '/tmp/backup'
            drivers: MyISAM
1 row in set (0.00 sec)

Even though the backup_state is "error", the error_num is zero, which is misleading.

Thinking about the issue of an interrupted BACKUP leaving "orphan" backup files (which should not happen), here is one hypothesis (rather far-fetched).

The code which removes unfinished backup images is present in Backup_restore_ctx::close() method (kernel.cc:1307):

    if (!m_completed && m_state == PREPARED_FOR_BACKUP)
    {
      int ret= m_stream->remove(); // Reports errors.
      if (ret != BSTREAM_OK)
        fatal_error(ER_CANT_DELETE_FILE);
    }
    else
    {
      int ret= m_stream->close();  // Reports errors.

      if (ret != BSTREAM_OK)
        fatal_error(ER_BACKUP_CLOSE);
    }

m_stream->remove() should remove the file. Member m_state is set to PREPARED_FOR_BACKUP in Backup_restore_ctx::prepare_for_backup(), just after the output stream is opened. Member m_completed is FALSE until explicitly set to TRUE at the end of Backup_restore_ctx::do_backup(), when complete image has been written.

Thus the only possibility of leaving unfinished file on disk which I can see is that m_stream->close() fails but the file stays on disk. This could be fixed with the following change of the above fragment:

    if (!m_completed && m_state == PREPARED_FOR_BACKUP)
    {
      int ret= m_stream->remove(); // Reports errors.
      if (ret != BSTREAM_OK)
        fatal_error(ER_CANT_DELETE_FILE);
    }
    else
    {
      int ret= m_stream->close();  // Reports errors.

      if (ret != BSTREAM_OK)
      {
        fatal_error(ER_BACKUP_CLOSE);
        m_stream->remove();	   // Ignore errors from remove().
      }
    }

Note: perhaps Stream::remove() must be updated so that it can be called  on a stream which is in error state.

Right now, I don't know how to verify if this change fixes anything...

Note: Issue B has been already reported as BUG#34767. In the discussion it was decided that it should be fixed in a generic way. Related WLs are WL#4385 and WL#5167.

Per Philip, he also cannot repeat this bug now.
But there is still a rare possibility (e.g. if someone pulls the plug on the h/w) for there to be an incomplete backup file.
After some discussion, documentation team proposed the limitations:
For BACKUP DATABASE:

If the operation fails, it returns an error. Any file created by the operation normally is removed. It is possible in rare cases that the incomplete image file will not be removed, in which case it should be removed manually. Using such an image file for RESTORE may render recovered databases unusable.

For RECOVER/RESTORE:

Be sure that the image file was created from a successful BACKUP DATABASE operation and has not been tampered with or modified. A RESTORE using a compromised image file may render recovered databases unusable.

Changing status to documenting. Since the document change is requested - per previous comment.

Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.