Bug #86865 InnoDB does unnecessary work when extending a tablespace
Submitted: 29 Jun 2017 8:09 Modified: 30 Jun 2017 15:25
Reporter: Alexey Kopytov Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.7, 8.0 OS:Any
Assigned to: CPU Architecture:Any

[29 Jun 2017 8:09] Alexey Kopytov
I'm looking at a benchmark with a few writers (INSERTs constantly
increasing the tablespace) and many readers (SELECTs preferring the most
recently inserted data).

The major bottleneck for this kind of workload appears to be tablespace
extension in InnoDB. What happens is that when an INSERT needs to do a
page split, it calls btr_cur_pessimistic_insert() with the page X-locked
and the index tree SX-locked. btr_cur_pessimistic_insert() calls
fsp_reserve_free_extents() which eventually boils down to the following
code in fil_space_extend():

#if !defined(NO_FALLOCATE) && defined(UNIV_LINUX)
		int     ret = posix_fallocate(node->handle.m_file, node_start, len);
		/* We already pass the valid offset and len in, if EINVAL
		is returned, it could only mean that the file system doesn't
		support fallocate(), currently one known case is
		ext3 FS with O_DIRECT. We ignore EINVAL here so that the
		error message won't flood. */
		if (ret != 0 && ret != EINVAL) {
			err = DB_IO_ERROR;
#endif /* NO_FALLOCATE || !UNIV_LINUX */

		if (!node->atomic_write || err == DB_IO_ERROR) {

			bool	read_only_mode;

			read_only_mode = (space->purpose != FIL_TYPE_TEMPORARY
					  ? false : srv_read_only_mode);

			err = fil_write_zeros(
				node, page_size, node_start,
				static_cast<ulint>(len), read_only_mode);

So, assuming we are on a non-FusionIO device, InnoDB calls
posix_fallocate(), but then calls fil_write_zeros() on the allocated
file space, even if posix_fallocate() is available and succeeded.

Which takes time on a busy disk, which in turn may block readers of the
page being split (which is a frequent situation, since readers prefer
recently inserted data) and writers (due to an SX-lock on the index
tree) for significant amounts of time.

I don't understand the purpose of fil_write_zeros() on a successful
posix_fallocate(). Regardless of the way posix_fallocate() is
implemented, the newly allocated space will be filled with zeros
automatically. On a reasonably recent Linux system, that will be done in
a more efficient way than physically writing zeroes.

I have a patch disabling fil_write_zeros() if posix_fallocate()
succeeded, which improves throughput in this kind of benchmark by
multiple times.

That code looks identical in both 5.7 and 8.0.

How to repeat:
It might be possible to simulate this workload with sysbench Lua, but I
don't have time to implement it right now. I may post it later, but
currently code inspection is the only way to verify.

Suggested fix:
Don't call fil_write_zeros() if posix_fallocate() succeeded.
[29 Jun 2017 14:48] OCA Admin
Contribution submitted via Github - Bug #86865: InnoDB does unnecessary work when extending a tablespace 
(*) Contribution by Alexey Kopytov (Github akopytov, mysql-server/pull/152#issuecomment-311943498): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_128093743.txt (text/plain), 1.33 KiB.

[30 Jun 2017 15:25] Sinisa Milivojevic
Hi Kaamos, my friend,

My opinion is that this is truly waste of resources, so I am verifying this bug.

Thank you very much for your patch and the entire contribution.

The only small caveat that I have is that there may be some library version, out there , which does not do automatic zerofill on posix_fallocate(). Hence, some additional checks might be required with our configuration sctipts.

Thanks again !!!!!!
[11 Apr 2018 7:30] zhai weixiang
hi, Alexey 
i think it makes sense to skip filling zero. Just one case: the crash recovery will only treate full zero page as a clean one, otherwise if the page is not initialized by filling zero, it may consider it as corrupted and refuse to startup. But this can be workaround during crash recovery: check if the first redo log of the page is MLOG_INIT_FILE_PAGE