Bug #70950 log_sys->mutex is so hot in pure write cases
Submitted: 19 Nov 2013 7:03 Modified: 20 Nov 2013 4:44
Reporter: Lixun Peng (OCA) Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: InnoDB storage engine Severity:S5 (Performance)
Version:5.7+ OS:Any
Assigned to: Assigned Account CPU Architecture:Any
Tags: log_sys->mutex, redo

[19 Nov 2013 7:03] Lixun Peng
Now log_sys protects by log_sys->mutex, this mutex is too big.

When we are running pure write cases (only contain inserts/updates), we can see the log_sys->mutex conflicts seriously on "show engine innodb status".

Especially we modified redo log files IO type to O_DIRECT, because of we are using PCI-E SSD for log files.

So I think log_sys->mutex should be split to several mutexes.

How to repeat:
Run any pure write scripts by sysbench, on a good server that contain very good IO hardwares.

Suggested fix:
We are using O_DIRECT for redo log files, because of we put all data files and log files in PCI-E SSD, so DirectIO is more fast.

The main idea is, splitting the mtr_commit to 3 steps:
1, Allocation space from log_sys->buf.
2, Copying mtr->log to log_sys->buf.
3, Writing log_sys->buf to disk.

And using 3 locks to protect them, log_sys->mutex for allocation, log_sys->copy_mutex[i] for copying, log_sys->write_mutex for writing.
copy_mutex is an array, it has srv_n_log_copy_mutexes items. Each log_sys->copy_mutex[i] protects a range space of log_sys->buf.

In Allocation & Copying step: (log_reserve_and_write_fast, log_write_low)
1, Get the length of mtr->log.
2, Hold log_sys->mutex.
3, Get log_sys->buf_free and log_sys->lsn value.
4, Acquire the log_sys->copy_mutex[i~j] the protects log_sys->buf_free ~ log_sys->buf_free+length.
5, To update log_sys->buf_free+=length and log_sys->lsn+=length.
6, Release the log_sys->mutex.
7. memcpy mtr->log to log->sys->buf.
8. Release the log_sys->copy_mutex[i~j].

In Writing step: (log_write_up_to)
1, Check if somebody else flushed log for this trx, if yes, return.
2, Check if somebody else flushing log, if yes, waiting log_sys->no_flush_event.
3, Hold log_sys->mutex, reset log_sys->no_flush_event.
4. Hold log_sys->write_mutex.
5, Get the range of flushing, log_sys->buf_next_to_write (area_start) ~log_sys->buf_free(area_end).
6, Updating log_sys->write_lsn, and other related variables.
7, Acquire the log_sys->copy_mutex[i~-j] that protects log_sys->buf from area_start to area_end, to make sure this buffer range has been copied.
7, Release log_sys->mutex.
8, Call log_group_write_buf() to write log_sys->buf to file.
9, Release log_sys->copy_mutex[i~j].
10,Release log_sys->write_mutex.

Moving log buffer: (In log_sys_check_flush_completion)
1, Hold log_sys->mutex.
2, Hold all log_sys->copy_mutex.
3, memmove log_sys->buf.
4, Release all log_sys->copy_mutex.
5, Release log_sys->mutex.

Now the patch in attach file is the simplest version, we verified the performance and transaction in some production cases. NOT ALL STEPS are implement in this version.

I have another version that implement as above completely, but not stable now, in some cases InnoDB can't recovery correct. Because in redo log file will contain this log sequence: mtr1.block1, mtr2.block1, mtr1.block2, mtr2.block2.... like this. I found InnoDB doesn't support this format, but Oracle can. I need to change to mtr1.block1, mtr1.block1, mtr2.block1, mtr2.block2... like this. Or I modify the recovery process code....

When it ready to test, I will upload that version.

However, if log_sys->buf can implement as blocks ring array is a very good thing, like log_sys->buf[i], buf[i] is a point to a log buffer block, then moving log buffer content is unnecessary. And each mtr->log block write a separate block in log buffer, it's more effects in high speed hardware.
[19 Nov 2013 7:05] Lixun Peng
patch on lastest mysql-5.5 trunk

Attachment: 2locks_forMySQL.diff (application/octet-stream, text), 15.64 KiB.

[19 Nov 2013 7:05] Lixun Peng
patch on lastest mysql-5.5 trunk

Attachment: 2locks_forMySQL.diff (application/octet-stream, text), 15.64 KiB.

[19 Nov 2013 7:05] Lixun Peng
patch on lastest mysql-5.5 trunk

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: 2locks_forMySQL.diff (application/octet-stream, text), 15.64 KiB.