Bug #53717 Improves docs for lack of crash safety
Submitted: 17 May 2010 19:14 Modified: 21 May 2010 14:49
Reporter: Mark Callaghan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:5.1 OS:Any
Assigned to: Paul DuBois CPU Architecture:Any
Tags: crash, replication, safety, slave

[17 May 2010 19:14] Mark Callaghan
Description:
I found once sentence in the online docs that describe the lack of crash safety.

This is at the end of http://dev.mysql.com/doc/refman/5.1/en/replication-features-shutdowns.html. It is very easy to miss this sentence. This warning is too vague. There are two problems 1) server boot that loses state because OS buffer cache not forced to disk and 2) crash between commit and call to flush_relay_log_info.

"""
Unclean shutdowns might produce problems, especially if the disk cache was not flushed to disk before the system went down. Your system fault tolerance is greatly increased if you have a good uninterruptible power supply. 
"""

How to repeat:
crash slave after commit and before write to relay-log.info

Suggested fix:
Describe the slave processing (commit to storage engine, write relay-log.info, optionally sync relay-log.info if 5.1 options are used). 

Describe the failures that can happen with this setup.
[17 May 2010 21:16] MySQL Verification Team
Thank you for the bug report.
[21 May 2010 14:49] Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly, and will be included in the next release of the relevant products.

Updated text at http://dev.mysql.com/doc/refman/5.5/en/replication-features-shutdowns.html:

"
Shutting down a slave cleanly is safe because it keeps track of where it left off. However, be careful that the slave does not have temporary tables open; see Section 16.4.1.19, “Replication and Temporary Tables”. Unclean shutdowns might produce problems, especially if the disk cache was not flushed to disk before the problem occurred:

For transactions, the slave commits and then updates relay-log.info. If a crash occurs between these two operations, relay log processing will have proceeded further than the information file indicates and the slave will re-execute the events from the last transaction in the relay log after it has been restarted.

A similar problem can occur if the slave updates relay-log.info but the server host crashes before the write has been flushed to disk. To minimize the chance of this occurring, set sync_relay_log_info=1 in the slave my.cnf file. The default value of sync_relay_log_info is 0, which does not cause writes to be forced to disk; the server relies on the operating system to flush the file from time to time.

The fault tolerance of your system for these types of problems is greatly increased if you have a good uninterruptible power supply.
"
(5.1 and earlier manuals are similar but do not mention sync_relay_log_info)