Bug #86865 InnoDB does unnecessary work when extending a tablespace
Submitted: 29 Jun 2017 8:09 Modified: 19 Oct 2020 14:17
Reporter: Alexey Kopytov Email Updates:
Status: Closed Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.7, 8.0 OS:Any
Assigned to: CPU Architecture:Any

[29 Jun 2017 8:09] Alexey Kopytov
I'm looking at a benchmark with a few writers (INSERTs constantly
increasing the tablespace) and many readers (SELECTs preferring the most
recently inserted data).

The major bottleneck for this kind of workload appears to be tablespace
extension in InnoDB. What happens is that when an INSERT needs to do a
page split, it calls btr_cur_pessimistic_insert() with the page X-locked
and the index tree SX-locked. btr_cur_pessimistic_insert() calls
fsp_reserve_free_extents() which eventually boils down to the following
code in fil_space_extend():

#if !defined(NO_FALLOCATE) && defined(UNIV_LINUX)
		int     ret = posix_fallocate(node->handle.m_file, node_start, len);
		/* We already pass the valid offset and len in, if EINVAL
		is returned, it could only mean that the file system doesn't
		support fallocate(), currently one known case is
		ext3 FS with O_DIRECT. We ignore EINVAL here so that the
		error message won't flood. */
		if (ret != 0 && ret != EINVAL) {
			err = DB_IO_ERROR;
#endif /* NO_FALLOCATE || !UNIV_LINUX */

		if (!node->atomic_write || err == DB_IO_ERROR) {

			bool	read_only_mode;

			read_only_mode = (space->purpose != FIL_TYPE_TEMPORARY
					  ? false : srv_read_only_mode);

			err = fil_write_zeros(
				node, page_size, node_start,
				static_cast<ulint>(len), read_only_mode);

So, assuming we are on a non-FusionIO device, InnoDB calls
posix_fallocate(), but then calls fil_write_zeros() on the allocated
file space, even if posix_fallocate() is available and succeeded.

Which takes time on a busy disk, which in turn may block readers of the
page being split (which is a frequent situation, since readers prefer
recently inserted data) and writers (due to an SX-lock on the index
tree) for significant amounts of time.

I don't understand the purpose of fil_write_zeros() on a successful
posix_fallocate(). Regardless of the way posix_fallocate() is
implemented, the newly allocated space will be filled with zeros
automatically. On a reasonably recent Linux system, that will be done in
a more efficient way than physically writing zeroes.

I have a patch disabling fil_write_zeros() if posix_fallocate()
succeeded, which improves throughput in this kind of benchmark by
multiple times.

That code looks identical in both 5.7 and 8.0.

How to repeat:
It might be possible to simulate this workload with sysbench Lua, but I
don't have time to implement it right now. I may post it later, but
currently code inspection is the only way to verify.

Suggested fix:
Don't call fil_write_zeros() if posix_fallocate() succeeded.
[29 Jun 2017 14:48] OCA Admin
Contribution submitted via Github - Bug #86865: InnoDB does unnecessary work when extending a tablespace 
(*) Contribution by Alexey Kopytov (Github akopytov, mysql-server/pull/152#issuecomment-311943498): I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: git_patch_128093743.txt (text/plain), 1.33 KiB.

[30 Jun 2017 15:25] MySQL Verification Team
Hi Kaamos, my friend,

My opinion is that this is truly waste of resources, so I am verifying this bug.

Thank you very much for your patch and the entire contribution.

The only small caveat that I have is that there may be some library version, out there , which does not do automatic zerofill on posix_fallocate(). Hence, some additional checks might be required with our configuration sctipts.

Thanks again !!!!!!
[11 Apr 2018 7:30] zhai weixiang
hi, Alexey 
i think it makes sense to skip filling zero. Just one case: the crash recovery will only treate full zero page as a clean one, otherwise if the page is not initialized by filling zero, it may consider it as corrupted and refuse to startup. But this can be workaround during crash recovery: check if the first redo log of the page is MLOG_INIT_FILE_PAGE
[19 Oct 2020 14:17] Alexey Kopytov
This appears to be fixed in 8.0.22. From the release notes:

"The new innodb_extend_and_initialize variable permits configuring how InnoDB allocates space to file-per-table and general tablespaces on Linux. By default, when an operation requires additional space in a tablespace, InnoDB allocates pages to the tablespace and physically writes NULLs to those pages. This behavior affects performance if new pages are allocated frequently. As of MySQL 8.0.22, you can disable innodb_extend_and_initialize on Linux systems to avoid physically writing NULLs to newly allocated tablespace pages. When innodb_extend_and_initialize is disabled, space is allocated using posix_fallocate() calls, which reserve space without physically writing NULLs.

A posix_fallocate() operation is not atomic, which makes it possible for a failure to occur between allocating space to a tablespace file and updating the file metadata. Such a failure can leave newly allocated pages in an uninitialized state, resulting in a failure when InnoDB attempts to access those pages. To prevent this scenario, InnoDB writes a redo log record before allocating a new tablespace page. If a page allocation operation is interrupted, the operation is replayed from the redo log record during recovery."
[20 Oct 2020 12:18] MySQL Verification Team
Hi Alexey,

Yes, you are correct.

Thanks a lot ........