Bug #78647 POTENTIAL RUNNING OUT OF SPACE IN REDO - INVALID MARGIN CALCULATIONS
Submitted: 30 Sep 2015 13:09 Modified: 7 Jan 2016 13:47
Reporter: Pawel Olchawa Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S3 (Non-critical)
Version:5.7 OS:Any
Assigned to: CPU Architecture:Any

[30 Sep 2015 13:09] Pawel Olchawa
Description:
If we had a lot of tablespaces we could easily run out of space in the Redo when performing checkpoints.

That could result in many true errors "The age of the last checkpoint is...".

Assume we are just about to write a long entry to Redo. We use log_margin_checkpoint_age(len) to ensure that there will be space for the entry in the Redo.

But we also have to ensure that after we write the entry to the Redo, there will be space for a MLOG_CHECKPOINT and MLOG_FILE_NAMEs. Otherwise we would be unable to write a checkpoint then, and we would have no way to reclaim space in Redo without overwriting important part of Redo.

The problem is that log_margin_checkpoint_age() is not accurate when calculating the required space. It doesn't reserve space for future MLOG_CHECKPOINT + MLOG_FILE_NAMEs.

It checks if log_sys->lsn - log_sys->last_checkpoint_lsn + margin > log_sys->log_group_capacity,
where margin = len*2.

There are important issues to be noticed here:

1) Inside the loop in log_margin_checkpoint_age there is:

                log_mutex_exit();

                if (!flushed_enough) {
                        os_thread_sleep(100000);
                }

                // Paweł: assume many threads calling log_margin_checkpoint_age meet here!

                log_checkpoint(true, false);

                log_mutex_enter();

Then inside log_checkpoint() there is:

        log_mutex_exit();

        log_write_up_to(flush_lsn, true);

        // Paweł: assume that all those threads sleep here!
        // So none of them has finished the checkpoint, so last_checkpoint_lsn stays unchanged,
        // and they all perform fil_names_clear above because of that.

        log_mutex_enter();

So multiple threads calling to log_margin_checkpoint_age could meet just in the middle of making checkpoint:
a) before any of them has finished the job,
b) but after each of them called fil_names_clear() producing a lot of MLOG_CHECKPOINTs !

There is "if" which was supposed to protect us, before we call to fil_names_clear():

        if (!write_always
            && oldest_lsn
            == log_sys->last_checkpoint_lsn + SIZE_OF_MLOG_CHECKPOINT) {
                /* Do nothing, because nothing was logged (other than
                a MLOG_CHECKPOINT marker) since the previous checkpoint. */
                log_mutex_exit();
                return(true);
        }

but please notice here, that last_checkpoint_lsn stays unchanged until at least one of those checkpoints is finished!

2) Assume len is small and we are writing to the end of redo block. We would increment log->lsn additionally by LOG_BLOCK_TRL_SIZE and LOG_BLOCK_HDR_SIZE which is 16 additional bytes in total. Then margin=len*2 could be smaller than required  len + LOG_BLOCK_TRL_SIZE + LOG_BLOCK_HDR_SIZE.

What's more after we called log_margin_checkpoint_age(len) inside mtr, we are possibly incrementing len by 1.

3) We have to reserve space in the Redo for possible future MLOG_FILE_NAMEs. The number of them is limited by current number of tablespaces (+1, because we are possibly adding a new one, while we are trying to add some redo in the middle of that). Each of them is limited by the maximum length of file name.

We should add that space to log->max_checkpoint_age whenever we use it!

The same applies to log->max_checkpoint_age_async! Otherwise we would end up in a situation where we don't force the check of margins (and do nothing in log_free_check).

How to repeat:
We should add a sleep in log_margin_checkpoint_age before:

                log_checkpoint(true, false);

                log_mutex_enter();

and simulate a heavy rw workload.

We should then see errors: "The age of the last checkpoint is..."

Suggested fix:
Inside calc_max_ages() there is:
    free = LOG_CHECKPOINT_FREE_PER_THREAD * (10 + srv_thread_concurrency)
                + LOG_CHECKPOINT_EXTRA_FREE;

and:
    margin = smallest_capacity - free;
    //...
    log_sys->max_checkpoint_age = margin;

It provides the margin for multiple threads trying to write a checkpoint, before any of them has finished, meeting in the middle!

Inside log_margin_checkpoint_age() we should compare to log_sys->max_checkpoint_age instead of log_group_capacity, so we should change:

   while (log_sys->lsn - log_sys->last_checkpoint_lsn + margin
               > log_sys->log_group_capacity) {

to:

   while (log_sys->lsn - log_sys->last_checkpoint_lsn + margin
               > log_sys->max_checkpoint_age) {

Additionally for the issue 2):

I would recommend to add LOG_BLOCK_TRL_SIZE + LOG_BLOCK_HDR_SIZE + 1 to variable "free" inside calc_max_ages(), together with a comment describing what is that for.

Then we would have less magic hidden behind the LOG_CHECKPOINT_EXTRA_FREE.

We should definitely find solution for issue 3). Otherwise we would have corrupted Redo very often if we had a lot of tablespaces!
[15 Oct 2015 12:34] Marko Mäkelä
Posted by developer:
 
The function log_margin_checkpoint_age() was introduced in WL#7868.
[7 Jan 2016 13:47] Daniel Price
Posted by developer:
 
Fixed as of the upcoming 5.7.11, 5.8.0 release, and here's the changelog entry:

To avoid a potential hang and redo log overwrite, the
innodb_log_file_size minimum value was increased from 1MB to 4MB, and the
length calculation in log_margin_checkpoint_age() was revised.

innodb_log_file_size documentation was also updated for 5.7 and 5.8:
http://dev.mysql.com/doc/refman/5.7/en/innodb-parameters.html#sysvar_innodb_log_file_size