Bug #94912 O_DIRECT_NO_FSYNC possible write hole
Submitted: 5 Apr 2019 2:14 Modified: 6 Apr 2019 21:30
Reporter: Janet Campbell Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:8.0.14 and above OS:Any
Assigned to: CPU Architecture:Any

[5 Apr 2019 2:14] Janet Campbell
Description:
This is most concerning in 8.0.14 and above, where O_DIRECT_NO_FSYNC is a recommended setting for many users and is set if innodb_dedicated_server is enabled.

At several places in the innodb code there are signs of a fundamental misunderstanding about O_DIRECT:

storage/innobase/include/srv0srv.h:  SRV_UNIX_O_DIRECT,   /*!< invoke os_file_set_nocache() on data files. This implies using non-buffered IO but still using fsync, the reason for which is that some FS do not flush meta-data when unbuffered IO happens */

storage/innobase/fil/fil0fil.cc:  /* Skip flushing if the file size has not changed since last flush was done and the flush mode is O_DIRECT_NO_FSYNC */

O_DIRECT does not and has never guaranteed write durability, and the reason for needing fsync() with it has nothing to do with fs metadata.  It returns when writes are flushed to the device, but not to the platters.  A final O_FUA write or BLKFLSBUF (or | O_SYNC/O_DSYNC) is needed to ensure that the writes are durable.  This introduces a possible write hole:

1. Transaction logged to WAL

2. Checkpoint runs and sends O_DIRECT writes to datafiles, which stay in device cache

3. LSN is advanced in WAL and it is fsync()ed, marking the previous transaction as committed

4. System crashes.  The WAL indicates that the data was already flushed, yet nothing made the data writes commit durably.  Recovery cannot proceed.

This likely is not an issue if the WAL and the datafiles share one disk, as fsync() in recent Linux versions generally forces a flush by ensuring at least one dirty, O_SYNC block will be written, causing a commit of device cache.

How to repeat:
Place WAL and datafiles on separate filesystems on separate disks.  blktrace the datafile device.  With O_DIRECT_NO_FSYNC, MongoDB will send writes for the datafiles during checkpoint but will fo nothing to flush them durably.  With journaling filesystems, you may have to place the journal on a separate device to see this clearly.

Suggested fix:
Do a final fsync() or BLKFLSBUF or anything at all that creates a cache flush on the datafile device, *before* the LSN is incremented in the WAL.  It's not necessary to flush out more data, only to cause the device to commit what has already been sent.
[5 Apr 2019 5:44] Janet Campbell
Sorry, I meant MySQL rather than Mongo - it's been a long day!

Anyway, the short version is that O_DIRECT always needs to be followed by: fsync(), a sync write, or a cache flush, somewhere on the same disk.  Otherwise data never has to be committed to the platter even though the write is successful.
[5 Apr 2019 13:13] MySQL Verification Team
HI,

Thank you for your bug report.

I think that your analysis is quite correct, for which reason I am accepting your report as a valid one.

Verified as reported.
[5 Apr 2019 16:10] Sunny Bains
I don't think there is any confusion around  how O_DIRECT works. We understand that an fsync is required. This option was a feature request from Facebook and it is not suitable for all FS and all setups as you describe.

1. http://mysqlha.blogspot.com/2013/03/mysql-56-no-odirectnofsync-for-you.html
2. Different file systems have different behavior w.r.t their meta-data and fsync().
3. What a disk does to a sync request is anybody's guess (manufacturers have been known to lie).

I think we should clarify this in the documentation.
[6 Apr 2019 20:31] Mike Griffin
It seems to me that innodb_dedicated_server should then imply O_DIRECT
[6 Apr 2019 21:30] Janet Campbell
> I don't think there is any confusion around  how O_DIRECT works. We understand that an fsync is required.

I had some concerns about that after reading the source material for when O_DIRECT_NO_FSYNC was added:

https://bugs.mysql.com/bug.php?id=45892

> "InnoDB calls fsync after writes to the datafile when innodb_flush_method=O_DIRECT. ***These are not needed.***" (emphasis mine)

http://mysqlha.blogspot.com/2009/06/buffered-versus-direct-io-for-innodb.html

> "Data files are opened with O_DIRECT when innodb_flush_method is set to O_DIRECT. *** fsync is still used in this case, but it doesn't need to be.***"

https://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-7.html

> "Performance; InnoDB: A new setting O_DIRECT_NO_FSYNC was added to the innodb_flush_method configuration option. This setting is similar to O_DIRECT, but omits the subsequent fsync() call. *** Suitable for some filesystems but not others.*** (Bug #11754304, Bug #45892)"

https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-14.html

> "The fsync() system call is still skipped after each write operation.

*** With the changes described above, O_DIRECT_NO_FSYNC mode can now be safely used on EXT4 and XFS file systems. (Bug #27309336) ***"

------------------

So, I put it to you: do you agree that there is a potential for data loss with this option on any filesystem operating on storage with cache, no matter how faithfully it obeys cache flushes?  How flushing the fs metadata can protect the filesystem, but without a flush being sent, you have no guarantee that the actual O_DIRECT data writes have been made durable?

If this is the case and you understand that the writes are not durable, do you think "can now be safely used" in the 8.0.14 release notes is correct to say about an option that could lose transactions?

"3. What a disk does to a sync request is anybody's guess (manufacturers have been known to lie)."

Yes but you're *not sending a sync request* to the disk at all.  You know that the write will not be committed as soon as it reaches the device and you're choosing to plow on ahead and hope it gets committed rather than send a flush to ensure that it does.  If the scope of this option is "performance enhancement, you may lose data in a crash", shouldn't that be documented somewhere?  Anywhere?  And maybe not become a commonly recommended option except with a warning?

And, well, here's a thought - a device cache flush is more lightweight than an fsync(), in general.  Why not an option like O_DIRECT_FLUSH that just sent a cache flush at the end rather than an fsync?  It would be higher performing than your O_DIRECT option, and unlike O_DIRECT_NO_FSYNC it would be safe.

I've got no stake in this, I'm just a researcher in somewhat exotic storage systems, and when I see unsafe behavior I try to let people know.

Thanks,

-Janet
[7 Apr 2019 0:08] Sunny Bains
One minor point, InnoDB's WAL (redo) is not a logical transaction log. In InnoDB the undo log is the logical transaction log and UNDO log files are treated as data files.

1. WAL files  always use buffered IO on Linux. Therefore we will never advance the checkpoint or flush LSN without first forcing an fsync(). This is the reason we don't fdatasync() on the WAL files. This covers the case when redo and data files are on the same file system.

2. I agreed that this will cause problems where the redo and data are on separate devices.

3. File systems do behave differently, even between releases too. I'm not aware of any canonical document for XFS and EXT4 (the ones we deal with the most). e.g., These too come with a disclaimer https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics (see "Allocating writes" section). 

4. "Why not an option like O_DIRECT_FLUSH that just sent 
a cache flush at the end rather than an fsync?" - This is something we need to look at, sounds like a step forward.
[7 Apr 2019 2:44] Sunny Bains
For #2 I meant same device not same file system.
[7 Apr 2019 3:04] Sunny Bains
I would like to thank you for looking at our code and suggesting. a good solution to the problem. This was not something we had considered yet.
[7 Apr 2019 6:40] Sunny Bains
One problem with using BLKFLSBUF is the requirement of CAP_SYS_ADMIN. The second problem is that it also flushes the OS page cache. Since InnoDB uses buffered IO for the WAL this will be problematic.

What we will investigate is what action to take if any data file is on a separate device from the WAL and O_DIRECT_NO_FSYNC is set. I'm guessing we will probably do an fsync for such files.
[8 Apr 2019 12:54] MySQL Verification Team
Sunny,

Thank you very much for your comments and answers.

Janet,

Thank you for your contribution.
[15 Apr 2019 16:24] Daniel Price
Posted by developer:
 
The O_DIRECT_NO_FSYNC documentation was revised. The following information was added:

"On storage devices with cache, data loss is possible if data files and
redo log files reside on different storage devices, and a crash occurs
before data file writes are flushed from the device cache. If you use or
intend to use different storage devices for redo logs and data files, use
O_DIRECT instead. "

The changelog entry for Bug #27309336 was also revised in the 8.0.14 release notes.

Changes should appear online soon. 

Thank you for the bug report.