MySQL Bugs: #69002: InnoDB reads transaction logs on writes

Bug #69002	InnoDB reads transaction logs on writes
Submitted:	18 Apr 2013 21:03	Modified:	29 Oct 2018 18:06
Reporter:	Domas Mituzas	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S3 (Non-critical)
Version:	5.6.12	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
Whenever InnoDB pwrite()s a transaction log entry, as it is aligned at 512-byte marks and not OS page boundaries, it has a relatively high chance to trigger a page read, if it has been evicted from OS cache. 

With large transaction logs that means that either InnoDB will hit this a lot, or amount of memory equal to innodb transaction log will be wasted. 

How to repeat:
1. sysctl -w vm.drop_caches=1
2. start tracing trace I/O on txlog partitions
3. write data into InnoDB
4. look at a trace

Example trace:

252,0    4     1819     0.871602860 29493  A   R 13164472 + 8 <- (252,1) 13162424
252,1    4     1820     0.871603681 29493  Q   R 13164472 + 8 [mysqld]
252,1    4     1821     0.871604761 29493  G   R 13164472 + 8 [mysqld]
252,1    4     1823     0.871605983 29493  I   R 13164472 + 8 [mysqld]
252,1    4     1825     0.871607892 29493  D   R 13164472 + 8 [mysqld]
252,1    4     1834     0.875700866     0  C   R 13164472 + 8 [0]
252,0    7      892     0.880682300 29493  A   R 13164512 + 8 <- (252,1) 13162464
252,1    7      893     0.880683463 29493  Q   R 13164512 + 8 [mysqld]
252,1    7      894     0.880685330 29493  G   R 13164512 + 8 [mysqld]
252,1    7      895     0.880686423 29493  I   R 13164512 + 8 [mysqld]
252,1    7      896     0.880696057 29493  D   R 13164512 + 8 [mysqld]
252,1    4     1839     0.884555295     0  C   R 13164512 + 8 [0]

Suggested fix:
There are multiple options:

1. ditch any kind of commit grouping and use O_DIRECT (O_ALL_ODIRECT flush option exists in various branches/trees/forks)
2. use 4k writes and waste lots of space in transaction logs (Percona have this implemented)
3. pad right side of log writes that are doing partial page write with zeroes, later overwrite the zeroes

option 3 is most efficient, as it will work with all kinds of storage, sync and buffered modes, etc.

Domas,

First of all, various OS's can have various page sizes, right or wrong ??? Hence, as this size is rarely configurable, you would vote for the configurable log entry sizes. Right or wrong ???

Next, with a specialized buffer for those logs, InnoDB would require less reads and would not require OS cache for it.

Next, well set controller and / or hard disk caches would also lower a number of required reads.

Regarding your proposed solutions, your favorite option, option 3, would require more disk reads, unless that page would be in memory until it's 4K are filled up. 

Tell me which of my presumptions are right and which are wrong here ....

Sinisa,

1. my knowledge is limited to very few systems, but on all of them there is a way to read OS page size setting from the OS.
2. InnoDB uses specialized buffer with innodb_flush_log_at_trx_commit=0, once you use 1 and 2 that buffer is not used, and cannot be used to offset this problem. The problem is about page-ins on partial page writes.
3. You are waving your hands. Well set controller usually has read-ahead disabled (why would you want read-ahead on write-only workload, in case of transaction log?), well set disks also have read-ahead disabled. The penalty of read-ahead is high, and we have measured it in various cases. Do note, that read-ahead properties deteriorate under heavy parallel workload, the amount of work sent by random access will make read-ahead opportunities impossible. Again, do tell me which raid/disk configuration would execute smart read-ahead in this pattern, I'm eager to hear.
4. I don't see how option 3 results in more disk reads, it results in less, it can be verified empirically. You may be correct only in case where we're talking about memory-mapped interfaces, where each write is kind of partial. InnoDB does not use mmap() for transaction logs, InnoDB uses pwrite(), and if you do page-sized pwrite to a page that is not in memory, there is no need for page-in. That can be seen, if you trace how I/O subsystems execute buffered InnoDB log I/O (flush=0). So, I have no idea, how you came up with your conclusion, but you did not involve any systems knowledge or experience in shaping it.

So, I think that your presumptions 1, 2, 3 and 4 are wrong. Have a good day!

Domas, nothing to surprise you in this reply, just letting you know that we know it's an issue and have during the 5.6 timeline tried various things that might or might not work better, but didn't find something that worked better without also working worse in other cases. It's something that's still on the agenda for 5.7. I don't know if the pad with zeros part your suggestion 3 was tried. I think that 1 and 2 were, along with at least one variation on 3 that did seem to be causing read-modify-write activity in tests.

For others, Domas is right about the current effects: when setting innodb_log_file_size at least on linux you are also telling the OS to try to allocate that much RAM times the number of log files and have to account for that in your RAM allocation budget. Or less desirably, you might end up with physical page reads instead, if the log files are big enough for them to have been flushed from the OS buffer.

this was addressed in 5.7.4 (with no reference to this bug):

"InnoDB: A new system variable, innodb_log_write_ahead_size, allows you to configure the write-ahead block size for redo logs to a value that matches the operating system or file system cache block size in order to avoid “read-on-write” for redo log writes. Read-on-write occurs when redo log blocks are not entirely cached to the operating system or file system due to a mismatch between write-ahead block size for redo logs and operating system or file system cache block size. Avoiding read-on-write improves throughput stability for redo log writes. (Bug #17571371)"

seems to be addressed