MySQL Bugs: #77214: InnoDB data lost if fast_shutdown=1 and trx

Bug #77214	InnoDB data lost if fast_shutdown=1 and trx_commit=0
Submitted:	1 Jun 2015 18:42	Modified:	26 Apr 2017 14:45
Reporter:	Rene' Cannao'	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.6	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	innodb

Description:
A combination of innodb_flush_log_at_trx_commit!=1 and innodb_fast_shutdown=1 can cause data lost as log buffer is not copied into redo log.

If innodb_flush_log_at_trx_commit!=1 , redo log is written by the background master thread. 
log_buffer_sync_in_background() is the function that writes the log buffer to the log file and if 'flush' is set (innodb_flush_log_at_trx_commit==2) it forces a flush of the log file as well.
log_buffer_sync_in_background() performs the write calling log_write_up_to() .

log_buffer_sync_in_background() is called by the background master thread only through srv_sync_log_buffer_in_background() .

While mysqld and InnoDB are up and running, srv_sync_log_buffer_in_background() is called by the background master thread either in srv_master_do_active_tasks() or srv_master_do_idle_tasks().
When InnoDB is shutting down (srv_shutdown_state>0), srv_sync_log_buffer_in_background() is skipped (not executed) by both srv_master_do_active_tasks() and srv_master_do_idle_tasks() .
Although, srv_sync_log_buffer_in_background() is called by InnoDB background master thread from srv_master_do_shutdown_tasks() only if srv_fast_shutdown==0 (innodb_fast_shutdown=0) .

To summarize so far, InnoDB background master thread copies the redo buffer to redo log only if InnoDB is not shutting down or during shutdown if innodb_fast_shutdown=0 .

Furthermore, when InnoDB is being closed as a storage engine, innobase_end() calls innobase_shutdown_for_mysql(), that calls logs_empty_and_mark_files_at_shutdown() .
Only if srv_fast_shutdown==2 (innodb_fast_shutdown=2) , logs_empty_and_mark_files_at_shutdown() calls log_buffer_flush_to_disk() that calls log_write_up_to(), writing the redo buffer to redo log file.

To summarize everything, if innodb_flush_log_at_trx_commit!=1 :
- the background master thread normally copies redo buffer to redo log;
- during shutdown, the background master thread copies redo buffer to redo log if innodb_fast_shutdown=0 ;
- while InnoDB is being unregistered as storage engine, redo buffer is copied to redo log only if innodb_fast_shutdown=2 ;
- if innodb_fast_shutdown=1 , InnoDB doesn't ensure writing redo buffer to redo log : this can cause data lost.

How to repeat:
Set innodb_flush_log_at_trx_commit!=1 and innodb_fast_shutdown=1 on a busy slave and restart mysqld .

Suggested fix:
Ensure that redo buffer is copied to redo log during shutdown.

A possible fix is to edit srv_master_do_shutdown_tasks() moving srv_sync_log_buffer_in_background() just before "if (srv_fast_shutdown == 1)"

This bug is present in MySQL 5.6 , and probably on older versions too.

Editing the title from trx_commit!=1 to trx_commit=0

After more digging into the source code, only a combination of innodb_fast_shutdown=1 and innodb_flush_log_at_trx_commit=0 triggers data lost during a normal graceful shutdown.
innodb_fast_shutdown=1 and innodb_flush_log_at_trx_commit=2 seems to be safe, because the content of the redo buffer is copied into the redo log file in trx_flush_log_if_needed() if innodb_flush_log_at_trx_commit=2 .

innodb_fast_shutdown does a sharp checkpoint.  It flushes all dirty buffers, but does not flush the change buffer (it persists in tablespace 0).  Because of the sharp checkpoint, there can't be anything in the log buffer, thus it need not be copied.

If innodb_fast_shutdown=1 + innodb_flush_log_at_trx_commit=0 caused data loss, vast numbers of slaves would be out of sync with masters constantly, because they are frequently restarted, and they frequently have innodb_flush_log_at_trx_commit=0.

So I do not think this is a bug.

I meant innodb_fast_shutdown=1 does sharp checkpoint.  innodb_fast_shutdown=2 does not flush dirty buffers and recovery is needed, but nothing committed at that point is lost, because the log buffer is flushed.  The log simply needs to replay over the pages changed since the last checkpoint, and if the LSN on the page has already been updated, it won't update the page again.

trx_commit=0 is so designed that you can loose last transaction !!!

This is in the manual !!!

Sinisa. the manual says that you can lose some transactions *in case of mysql crash*, we are talking about a clean shutdown here...

@Justin, this is my understanding:

In a sharp checkpoint, all the pages are flushed to disk for that specific checkpoint LSN .
During shutdown, from what I see, all the flushing is done by buf_flush_page_cleaner_thread(), while the checkpoint itself is done by the background master thread in srv_master_do_shutdown_tasks() calling log_checkpoint() .
log_checkpoint() calls log_buf_pool_get_oldest_modification() to determine the lsn to checkpoint, and log_buf_pool_get_oldest_modification() returns the oldest modified block lsn in the buffer pool, or log_sys->lsn if none exists.
As checkpoint and flushing is done in two different threads, race conditions may happens. To be more specific, I believe it is possible that while buf_flush_page_cleaner_thread() is still flushing pages (so there are still dirty pages in buffer pool), the background master thread exists without making a checkpoint up to log_sys->lsn .

In fact, reading the comments in buf_flush_page_cleaner_thread(), it is expected that the master and the purge threads exit while buf_flush_page_cleaner_thread() it is still running.
I believe that, if the master thread exits before the buffer pool has been flushed completely, log_checkpoint() shouldn't be enough to ensure that the latest transactions were copied from redo buffer into redo log.

I still believe that srv_sync_log_buffer_in_background() should be called even if innodb_fast_shutdown=1 .

@Sinisa: I assume a clean shutdown never causes data lost, no matter the value of innodb_flush_log_at_trx_commit .

InnoDB follows WAL (write ahead logging) which implies that no dirty page can ever be written to disk before the relevant redo entries have made it to the disk. This is true whatever innodb_flush_log_at_trx_commit setting. There is a call in buf_flush_write_block_low() to log_write_up_to() which ensures that.

Given above, when page_cleaner goes through the entire buffer pool writing down all pages during shutdown it is guaranteed that redo log will get written to the disk.

The final checkpoint is not made in the master thread. It is made in the thread triggering the shutdown after the page_cleaner has exited. IIRC after the checkpoint is made and flushed to disk the checkpoint lsn is stamped on the system tablespace header page as well.

My dear Rene,

As you well know we need a fully repeatable case, in order to be able to verify the bug.

Also, this bug is not a duplicate of #85585.