Bug #94912 | O_DIRECT_NO_FSYNC possible write hole | ||
---|---|---|---|
Submitted: | 5 Apr 2019 2:14 | Modified: | 6 Apr 2019 21:30 |
Reporter: | Janet Campbell | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S3 (Non-critical) |
Version: | 8.0.14 and above | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[5 Apr 2019 2:14]
Janet Campbell
[5 Apr 2019 5:44]
Janet Campbell
Sorry, I meant MySQL rather than Mongo - it's been a long day! Anyway, the short version is that O_DIRECT always needs to be followed by: fsync(), a sync write, or a cache flush, somewhere on the same disk. Otherwise data never has to be committed to the platter even though the write is successful.
[5 Apr 2019 13:13]
MySQL Verification Team
HI, Thank you for your bug report. I think that your analysis is quite correct, for which reason I am accepting your report as a valid one. Verified as reported.
[5 Apr 2019 16:10]
Sunny Bains
I don't think there is any confusion around how O_DIRECT works. We understand that an fsync is required. This option was a feature request from Facebook and it is not suitable for all FS and all setups as you describe. 1. http://mysqlha.blogspot.com/2013/03/mysql-56-no-odirectnofsync-for-you.html 2. Different file systems have different behavior w.r.t their meta-data and fsync(). 3. What a disk does to a sync request is anybody's guess (manufacturers have been known to lie). I think we should clarify this in the documentation.
[6 Apr 2019 20:31]
Mike Griffin
It seems to me that innodb_dedicated_server should then imply O_DIRECT
[6 Apr 2019 21:30]
Janet Campbell
> I don't think there is any confusion around how O_DIRECT works. We understand that an fsync is required. I had some concerns about that after reading the source material for when O_DIRECT_NO_FSYNC was added: https://bugs.mysql.com/bug.php?id=45892 > "InnoDB calls fsync after writes to the datafile when innodb_flush_method=O_DIRECT. ***These are not needed.***" (emphasis mine) http://mysqlha.blogspot.com/2009/06/buffered-versus-direct-io-for-innodb.html > "Data files are opened with O_DIRECT when innodb_flush_method is set to O_DIRECT. *** fsync is still used in this case, but it doesn't need to be.***" https://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-7.html > "Performance; InnoDB: A new setting O_DIRECT_NO_FSYNC was added to the innodb_flush_method configuration option. This setting is similar to O_DIRECT, but omits the subsequent fsync() call. *** Suitable for some filesystems but not others.*** (Bug #11754304, Bug #45892)" https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-14.html > "The fsync() system call is still skipped after each write operation. *** With the changes described above, O_DIRECT_NO_FSYNC mode can now be safely used on EXT4 and XFS file systems. (Bug #27309336) ***" ------------------ So, I put it to you: do you agree that there is a potential for data loss with this option on any filesystem operating on storage with cache, no matter how faithfully it obeys cache flushes? How flushing the fs metadata can protect the filesystem, but without a flush being sent, you have no guarantee that the actual O_DIRECT data writes have been made durable? If this is the case and you understand that the writes are not durable, do you think "can now be safely used" in the 8.0.14 release notes is correct to say about an option that could lose transactions? "3. What a disk does to a sync request is anybody's guess (manufacturers have been known to lie)." Yes but you're *not sending a sync request* to the disk at all. You know that the write will not be committed as soon as it reaches the device and you're choosing to plow on ahead and hope it gets committed rather than send a flush to ensure that it does. If the scope of this option is "performance enhancement, you may lose data in a crash", shouldn't that be documented somewhere? Anywhere? And maybe not become a commonly recommended option except with a warning? And, well, here's a thought - a device cache flush is more lightweight than an fsync(), in general. Why not an option like O_DIRECT_FLUSH that just sent a cache flush at the end rather than an fsync? It would be higher performing than your O_DIRECT option, and unlike O_DIRECT_NO_FSYNC it would be safe. I've got no stake in this, I'm just a researcher in somewhat exotic storage systems, and when I see unsafe behavior I try to let people know. Thanks, -Janet
[7 Apr 2019 0:08]
Sunny Bains
One minor point, InnoDB's WAL (redo) is not a logical transaction log. In InnoDB the undo log is the logical transaction log and UNDO log files are treated as data files. 1. WAL files always use buffered IO on Linux. Therefore we will never advance the checkpoint or flush LSN without first forcing an fsync(). This is the reason we don't fdatasync() on the WAL files. This covers the case when redo and data files are on the same file system. 2. I agreed that this will cause problems where the redo and data are on separate devices. 3. File systems do behave differently, even between releases too. I'm not aware of any canonical document for XFS and EXT4 (the ones we deal with the most). e.g., These too come with a disclaimer https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO%27s_Semantics (see "Allocating writes" section). 4. "Why not an option like O_DIRECT_FLUSH that just sent a cache flush at the end rather than an fsync?" - This is something we need to look at, sounds like a step forward.
[7 Apr 2019 2:44]
Sunny Bains
For #2 I meant same device not same file system.
[7 Apr 2019 3:04]
Sunny Bains
I would like to thank you for looking at our code and suggesting. a good solution to the problem. This was not something we had considered yet.
[7 Apr 2019 6:40]
Sunny Bains
One problem with using BLKFLSBUF is the requirement of CAP_SYS_ADMIN. The second problem is that it also flushes the OS page cache. Since InnoDB uses buffered IO for the WAL this will be problematic. What we will investigate is what action to take if any data file is on a separate device from the WAL and O_DIRECT_NO_FSYNC is set. I'm guessing we will probably do an fsync for such files.
[8 Apr 2019 12:54]
MySQL Verification Team
Sunny, Thank you very much for your comments and answers. Janet, Thank you for your contribution.
[15 Apr 2019 16:24]
Daniel Price
Posted by developer: The O_DIRECT_NO_FSYNC documentation was revised. The following information was added: "On storage devices with cache, data loss is possible if data files and redo log files reside on different storage devices, and a crash occurs before data file writes are flushed from the device cache. If you use or intend to use different storage devices for redo logs and data files, use O_DIRECT instead. " The changelog entry for Bug #27309336 was also revised in the 8.0.14 release notes. Changes should appear online soon. Thank you for the bug report.