Bug #31876 InnoDB commit performance slow on Windows XP
Submitted: 26 Oct 2007 12:09 Modified: 25 Feb 2014 19:17
Reporter: Kristian Koehntopp Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S4 (Feature request)
Version:5.0.46 OS:Windows (Windows XP)
Assigned to: CPU Architecture:Any

[26 Oct 2007 12:09] Kristian Koehntopp
Description:
During a consulting gig, a customer benchmarked MySQL on Windows XP against MS SQL server and IBM DB/2 on the very same hardware (low powered POS system with little memory and single, slow disk). The benchmark is 500 inserts/commit pairs. While DB/2 and MS SQL process the benchmark in 6 seconds, MySQL takes 12 second when ACID compliance is enabled.

How to repeat:
Run the attached 500_inserts_mysql_with_create.txt, observe system behaviour with ProcMon and FileMon from http://www.sysinternals.com.

When using innodb_flush_log_on_trx_commit = 1, we get

94      14:43:22        mysqld-nt.exe:2736      WRITE   C:\Programme\MySQL\MySQL Server 5.0\data\ib_logfile0    SUCCESS Offset: 2258432 Length: 1024
95      14:43:22        mysqld-nt.exe:2736      FLUSH   C:\Programme\MySQL\MySQL Server 5.0\data\ib_logfile0    SUCCESS
96      14:43:22        mysqld-nt.exe:2736      WRITE   C:\Programme\MySQL\MySQL Server 5.0\data\ib_logfile0    SUCCESS Offset: 2258944 Length: 512
97      14:43:22        mysqld-nt.exe:2736      FLUSH   C:\Programme\MySQL\MySQL Server 5.0\data\ib_logfile0    SUCCESS

from FileMon.

When using innodb_flush_log_on_trx_commit = 2, we get

101     14:35:55        mysqld-nt.exe:548       WRITE   C:\Programme\MySQL\MySQL Server 5.0\data\ib_logfile0    SUCCESS Offset: 1638912 Length: 1024
102     14:35:55        mysqld-nt.exe:548       WRITE   C:\Programme\MySQL\MySQL Server 5.0\data\ib_logfile0    SUCCESS Offset: 1639424 Length: 512

from FileMon.

Further investigation using ProcMon finds that the logfile is opened with

Sequence:       12795
Date & Time:    25.10.2007 11:01:33
Event Class:    File System
Operation:      CreateFile
Result: NAME COLLISION
Path:   C:\Programme\MySQL\MySQL Server 5.0\data\ib_logfile0
TID:    2688
Duration:       0.0000131
Desired Access: Generic Read/Write
Disposition:    Create
Options:        No Buffering, Synchronous IO Non-Alert, Non-Directory File
Attributes:     n/a
ShareMode:      Read
AllocationSize: 0

Compare to DB/2:

DB/2 writes to their Redo Log:

408     15:05:10        db2syscs.exe:2020       WRITE   D:\Daten\Filis\DB2log\S0007531.LOG      SUCCESS Offset: 970752 Length: 4096
409     15:05:10        db2syscs.exe:2020       WRITE   D:\Daten\Filis\DB2log\S0007531.LOG      SUCCESS Offset: 970752 Length: 4096

Notice how FLUSH operations are absent.

Further investigation of MS SQL and DB/2 find that they use CreateFile with additional flags FLAG_FILE_WRITE_THROUGH and FLAG_FILE_RANDOM_ACCESS. MS Knowlegebase entry http://support.microsoft.com/kb/332023 suggests that FLAG_FILE_WRITE_THROUGH is necessary and makes the FLUSH calls redundant.

Suggested fix:
Iggy improved on a patch idea of mine and suggests

===== innobase/os/os0file.c 1.121 vs edited =====
--- 1.121/innobase/os/os0file.c 2007-08-15 18:54:19 -04:00
+++ edited/innobase/os/os0file.c        2007-10-25 17:34:01 -04:00
@@ -1218,9 +1218,13 @@ try_again:
                        /* Do not use unbuffered i/o to log files
because
                        value 2 denotes that we do not flush the log at
every
                        commit, but only once per second */
+                       attributes |= (FILE_FLAG_WRITE_THROUGH
+                              | FILE_FLAG_RANDOM_ACCESS);
                } else if (srv_win_file_flush_method ==

SRV_WIN_IO_UNBUFFERED) {
-                       attributes = attributes |
FILE_FLAG_NO_BUFFERING;
+                       attributes |= (FILE_FLAG_NO_BUFFERING
+                              | FILE_FLAG_WRITE_THROUGH
+                              | FILE_FLAG_RANDOM_ACCESS);
                }
 #endif
        } else if (purpose == OS_FILE_NORMAL) {
@@ -1232,7 +1236,9 @@ try_again:
                        commit, but only once per second */
                } else if (srv_win_file_flush_method ==

SRV_WIN_IO_UNBUFFERED) {
-                       attributes = attributes |
FILE_FLAG_NO_BUFFERING;
+                       attributes |= (FILE_FLAG_NO_BUFFERING
+                              | FILE_FLAG_WRITE_THROUGH
+                              | FILE_FLAG_RANDOM_ACCESS);
                }
 #endif
        } else {

I tested this version of and I got this CreateFile event in ProcMon

Sequence:       17424
Date & Time:    10/25/2007 5:18:03 PM
Event Class:    File System
Operation:      CreateFile
Result: SUCCESS
Path:   C:\Program Files\MySQL\MySQL Server 5.0\data\ib_logfile0
TID:    3788
Duration:       0.0001308
Desired Access: Generic Read/Write
Disposition:    Open
Options:        Write Through, Synchronous IO Non-Alert, Non-Directory File,
Random Access
Attributes:     n/a
ShareMode:      Read
AllocationSize: n/a
OpenResult:     Opened

Kent provided a test build with this and I was able to get the benchmark time down to about 2.5 sec from 12 sec on my test system. The HDD LED pattern suggests ongoing individual file writes to the HDD.

Please

- investigate if the combination of this patch + innodb_flush_log_on_trx_commit = 2 is ACID
- if so, push this patch into the current 5.0 release and onward
- apply similar reasoning to the implementation of sync_binlog on Windows (no flush, just open the binlog with FILE_FLAG_WRITE_THROUGH)
[26 Oct 2007 12:11] Kristian Koehntopp
Benchmark: 500 insert/commit pairs

Attachment: 500_inserts_mysql_with_create.txt (text/plain), 118.25 KiB.

[26 Oct 2007 12:12] Kristian Koehntopp
FileMon log at innodb_flush_log_at_trx_commit = 1

Attachment: binlog_off_trx_1.LOG (text/plain), 132.34 KiB.

[26 Oct 2007 12:12] Kristian Koehntopp
FileMon log at innodb_flush_log_at_trx_commit = 2

Attachment: binlog_off_trx_2.LOG (text/plain), 92.50 KiB.

[26 Oct 2007 12:13] Kristian Koehntopp
FileMon log of DB/2 doing the very same benchmark

Attachment: db2_prozesse_only.LOG (text/plain), 187.89 KiB.

[26 Oct 2007 12:14] Kristian Koehntopp
innodb_flush_method = normal/unbuffered differences

Attachment: create_file_calls.txt (text/plain), 1.30 KiB.

[26 Oct 2007 12:17] Kristian Koehntopp
The attached benchmark file is being executed on Windows XP and SuSE Linux Professional 10.2 on the same system (Laptop with 125seek/sec = 8ms access time 7200 rpm HDD).

Benchmark results, Times in seconds
===================================

Unpatched server 5.0.46

a/b = Windows/Linux
      Windows with Defaults           (Write Cache disabled)
      Linux after hdparm -W0 /dev/sda (Write Cache disabled)

    binlog
    off         on
1   14.0/12.0   32.0/21.0
2    4.0/ 0.6    4.0/ 0.9
0    4.0/ 0.6    3.0/ 0.6
^
+- innodb_flush_log_on_trx_commit = ?

After patch: binlog off, trx = 2 -> 2 sec.
[31 Oct 2007 21:35] Roland Volkmann
The referenced MS KB Article tells us:

"... Write Through command, which is only available for SCSI disk devices, is implemented by issuing a WRITE command to the disk with the ForceUnitAccess (FUA) bit set. [...] Support for the FUA bit is optional, and only some SCSI and Fiber Channel (FC) devices, typically individual drives (as opposed to RAID arrays), implement this functionality."

So if this method will be implemented, it must be selectable as an alternative by a parameter, e.g. in my.ini
[1 Nov 2007 16:43] Heikki Tuuri
Kristian,

thank you for this contribution.

Your benchmark on your laptop shows that InnoDB can only do about 40 commits/second with innodb_flush_log_at_trx_commit=1.

I have seen a similar phenomenon on my Windows XP computers.

The question is how we can detect if the computer supports the write through mechanism. A my.cnf option innodb_flush_method=write_through or something might be warranted. Then it is up to the user to ascertain that it really is ACID.

Though, I have not been that worried about the 40 commits/second speed, because anyone who wants a fast commit should be using a battery-backed disk controlled cache with write back. Then we can get 10,000 commits per second. Also in this case, the user has to test that the battaery backing really works. It does not work on many Linux/hardware combinations.

Regards,

Heikki
[2 Nov 2007 8:43] Kristian Koehntopp
Heikki, the customer I was with is running this on POS cash registers, potentially thousands of them. The machines are underpowered, have little memory and quite slow disks - price per machine is the primary optimization goal. The customer was testing with MySQL, MS SQL and IBM DB/2.

What I did was basically analyze the default write behaviour of MS SQL and DB/2, and duplicate that with a patched InnoDB. I did not test correctness of that behaviour any more than with a cursory glance at the disk LED (and it looked good, despite the fact that this was using a ST320014A or ST320015A disk).

With the patch, InnoDB behaves the same as MS SQL and DB/2, write strategy and speed are compareable. Without the patch, using FileFlushBuffers, it may or may not be more correct, but is seven times slower than with the patch and four times slower that the competition.

So I am with your suggestion of innodb_flush_method = write_through, and leave it to the customer to ensure correctness.

This is also what DB/2 did in the past - they had been using FlushFileBuffers, then made this configureable and now it is the default. At least that is what the customers resident DB/2 guru told me.
[2 Nov 2007 13:45] Heikki Tuuri
Kristian,

ok, thank you for the very thorough analysis :).

Assigning the feature request to myself.

Regards,

Heikki
[7 Nov 2007 6:59] Tanja Baumann
Hello!

I'm very happy that you will implement innodb_flush_method = write_through now but from the point of view from the customer I am not happy at all  with "leave it to the customer to ensure correctness"! 
For us this bugfix is very important because we need this better performance. But of course we want to be sure that mysql works correctly with this fix. We dont't have the possibility to ensure correctness. So please investigate a little bit WHY this innodb_flush_method = write_through is bringing better performance although there are NON-SCSI-disks used. There must be an explanation what exactly is happening using the innodb_flush_method = write_through and non-scsi-disks.

Thanks and regards,

Tanja Pritzlaff
[9 Nov 2007 3:51] James Day
"investigate if the combination of this patch + innodb_flush_log_on_trx_commit = 2 is ACID"

It's not inherently ACID. Per Microsoft's KB article: "Do not implement this fix unless your computer and any connected hard disks are protected against accidental or inadvertent power loss through hardware design that may include such features as redundant battery-backed power supplies. ... Do not implement this fix unless you understand and accept the level of risk that may be involved, and are confident that this risk has been mitigated through appropriate hardware power protection"

That is: it fails on durability, by design. Unless you have battery backed up drive caching and have tested that to be sure it works. The option would be fine for a POS terminal that has built in battery backup so it can operate during a power outage, provided the terminal sent low battery warnings so the OS can shut down the database when required. Such a terminal could presumably use innodb_flush_log_at_trx_commit=2, or 0.

Your test was on a notebook with a 7200 RPM hard drive. That means it can do no more than 120 revolutions and writes to the end of a file that is in a specific spot per second. Any number of commits higher than that without commit merging means that the writes are not durable, because they must be getting cached in the drive hardware. Except that your notebook has a battery, and as long as the battery power lasts, that makes it a durable configuration.

Write-through could be faster than flush because write-through would only flush the individual writes, not the whole drive buffer. But it still can't do more writes per second than the number of revolutions per second of the drive allows. So it is worth implementing. Just be aware that your tests are almost certainly with hardware that is not durable unless it has battery backup, because they seem to be exceeding the physical capability of the drive.
[9 Nov 2007 8:57] Guilhem Bichot
Hi James,
I think there was a misunderstanding. Kristian was wondering if his patch (to the InnoDB code files), + innodb_flush_log_at_trx_commit=2, is ACID. The Microsoft KB article explains how a certain fix (which is not Kristian's) is not ACID. That certain fix is turning on the "power protected" setting in Windows, provided that setting is available, which is made possible by installing some software.
As for Kristian's question, in theory: if disk supports it (which is not always true, requires SCSI at least), a write-through write does reach the disk's platters immediately. innodb_flush_log_at_trx_commit=2 means "the log buffer is written out to the file at each commit, but the flush to disk operation is not performed on it"; and a write reaches the platters immediately thanks to write-through.
So, in theory, provided the disk supports write-through and there are no bugs, it looks ACID.
[9 Nov 2007 9:14] Tanja Baumann
Hi,

Yes that is true and the further question is: what is with disks not supporting write_through? Because in our test there were non-scsi-disks but although there were non-scsi-disks they get this better performance expected from the fix. So the question is: is this also acid, if the disks are non-scsi, and although in documentation we can read that only for scsi-disks the writ_through is working?

And an other question: if we would disable the disk-write-cache, would it be then acid?

regards,

Tanja
[9 Nov 2007 17:53] Kristian Koehntopp
Apparently, with the fix as suggest by Heikki you have two options:

- Same behaviour re ACID and speed as in DB/2 and MS SQL. This may or not be ACID depending on properties of hardware, driver and OS. MySQL will be as ACID or non-ACID as the other products.

- Traditional behavior with Flush, this is slower, but supposed to be ACID on all hardware.

Also, apparently Windows disables the write-cache by default, Linux does not (hdparm -W0 ... required on the tested hardware).
[13 Nov 2007 5:28] James Day
Guilhem,

I probably wrote more than I should about what happens if a drive doesn't support it.:)

I agree that changing innodb_flush_log_at_trx_commit=2 so that each transaction commit is written directly to disk surface makes 2 become ACID if the writes really get to the disk surface. That makes 2 the same as 1, with the same limit to committing no more often than the end of the log file can move under the disk heads. That completely breaks innodb_flush_log_at_trx_commit=2 on hardware that supports FLAG_FILE_WRITE_THROUGH since the whole point of 2 is not to wait for the disk to write to the surface.

Kristian's test seems to be showing that his disk is not obeying FLAG_FILE_WRITE_THROUGH saying that it should write to the disk surface but is instead doing hardware buffered commits at a rate faster than the absolute hardware maximum of disk RPM/60 per second for real committed physical writes to the end of the same file. Using hardware buffered writes is an obvious performance gain but it's not ACID.

This doesn't really seem to belong to either innodb_flush_log_at_trx_commit=1 or 2. His drive works properly for 1 (slow as expected) but is broken with this in his test with 2 (faster than RPM, so must be buffering) so it looks as though using this with 1 would break systems that work properly now. And using it with 2 would break 2 for systems that support FLAG_FILE_WRITE_THROUGH, slowing them to about the same rate as 1 because they will really do the writes to the disk surface.

It looks like a good idea but it doesn't seem to be something that's sensible to do with any of the existing innodb_flush_log_at_trx_commit settings. Maybe add a 4 setting or a different setting to say use FLAG_FILE_WRITE_THROUGH instead of syncs. Adding 4 looks like the safest idea since that won't break anything.

Tanja, it's expected that it will be faster for drives that do not support FLAG_FILE_WRITE_THROUGH. Since they don't support it they will continue to use their hardware buffer to speed up the writes. That's not ACID but is faster than the real ACID flushing with 1, while probably still slower than with the original 2 without FLAG_FILE_WRITE_THROUGH, because of the extra overhead of the extra writes to the disk buffer.

Kristian, with a drive that supports FLAG_FILE_WRITE_THROUGH it should be a little faster than innodb_flush_log_at_trx_commit=1. It should be faster for both the log file (doesn't flush data for other files from the drive buffer) and the other files (don't get their data flushed). Looks like a general win for systems that support it.
[16 Nov 2007 12:50] Tanja Baumann
Hello,

thank you for the answers!

Can you already say WHEN the fix will be probably available? (Cause we need it urgendly)

regards
Tanja
[26 Nov 2007 7:56] Tanja Baumann
Hello!

Could you please tell us the approximate time when the fix will be available (or which release) because we have to plan our tests with this fix.

Thank you and best regards
Tanja
[16 Jan 2008 12:17] Heikki Tuuri
Tanja,

we may implement this is a future version of InnoDB.

If you need this immediately, you can edit InnoDB source code and compile mysqld using MS Visual C++.

I am changing Severity to Feature Request.

Regards,

Heikki
[26 Aug 2008 14:40] Brian Morearty
Hi,

I'm not familiar with the voting procedures for enhancements (or whether there even is one), but I would *love* to see this enhancement made in a soon-to-come version of InnoDB. I'd like to allow write-caching on my dev machine (Windows) to allow better developer productivity.

Thanks.

Brian
[15 Feb 2010 20:38] Kristian Köhntopp
Is this request still being worked on?