Description:
Now log_sys protects by log_sys->mutex, this mutex is too big.
When we are running pure write cases (only contain inserts/updates), we can see the log_sys->mutex conflicts seriously on "show engine innodb status".
Especially we modified redo log files IO type to O_DIRECT, because of we are using PCI-E SSD for log files.
So I think log_sys->mutex should be split to several mutexes.
How to repeat:
Run any pure write scripts by sysbench, on a good server that contain very good IO hardwares.
Suggested fix:
We are using O_DIRECT for redo log files, because of we put all data files and log files in PCI-E SSD, so DirectIO is more fast.
The main idea is, splitting the mtr_commit to 3 steps:
1, Allocation space from log_sys->buf.
2, Copying mtr->log to log_sys->buf.
3, Writing log_sys->buf to disk.
And using 3 locks to protect them, log_sys->mutex for allocation, log_sys->copy_mutex[i] for copying, log_sys->write_mutex for writing.
copy_mutex is an array, it has srv_n_log_copy_mutexes items. Each log_sys->copy_mutex[i] protects a range space of log_sys->buf.
In Allocation & Copying step: (log_reserve_and_write_fast, log_write_low)
1, Get the length of mtr->log.
2, Hold log_sys->mutex.
3, Get log_sys->buf_free and log_sys->lsn value.
4, Acquire the log_sys->copy_mutex[i~j] the protects log_sys->buf_free ~ log_sys->buf_free+length.
5, To update log_sys->buf_free+=length and log_sys->lsn+=length.
6, Release the log_sys->mutex.
7. memcpy mtr->log to log->sys->buf.
8. Release the log_sys->copy_mutex[i~j].
In Writing step: (log_write_up_to)
1, Check if somebody else flushed log for this trx, if yes, return.
2, Check if somebody else flushing log, if yes, waiting log_sys->no_flush_event.
3, Hold log_sys->mutex, reset log_sys->no_flush_event.
4. Hold log_sys->write_mutex.
5, Get the range of flushing, log_sys->buf_next_to_write (area_start) ~log_sys->buf_free(area_end).
6, Updating log_sys->write_lsn, and other related variables.
7, Acquire the log_sys->copy_mutex[i~-j] that protects log_sys->buf from area_start to area_end, to make sure this buffer range has been copied.
7, Release log_sys->mutex.
8, Call log_group_write_buf() to write log_sys->buf to file.
9, Release log_sys->copy_mutex[i~j].
10,Release log_sys->write_mutex.
Moving log buffer: (In log_sys_check_flush_completion)
1, Hold log_sys->mutex.
2, Hold all log_sys->copy_mutex.
3, memmove log_sys->buf.
4, Release all log_sys->copy_mutex.
5, Release log_sys->mutex.
Now the patch in attach file is the simplest version, we verified the performance and transaction in some production cases. NOT ALL STEPS are implement in this version.
I have another version that implement as above completely, but not stable now, in some cases InnoDB can't recovery correct. Because in redo log file will contain this log sequence: mtr1.block1, mtr2.block1, mtr1.block2, mtr2.block2.... like this. I found InnoDB doesn't support this format, but Oracle can. I need to change to mtr1.block1, mtr1.block1, mtr2.block1, mtr2.block2... like this. Or I modify the recovery process code....
When it ready to test, I will upload that version.
However, if log_sys->buf can implement as blocks ring array is a very good thing, like log_sys->buf[i], buf[i] is a point to a log buffer block, then moving log buffer content is unnecessary. And each mtr->log block write a separate block in log buffer, it's more effects in high speed hardware.