MySQL Bugs: #119759: InnoDB allow reads on non-durable data, which can be lost after crash(visibility before durability window)

Bug #119759	InnoDB allow reads on non-durable data, which can be lost after crash(visibility before durability window)
Submitted:	22 Jan 18:55	Modified:	22 Jan 21:14
Reporter:	Divyam Jaiswal	Email Updates:
Status:	Open	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	8.0.42	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	data loss, innodb, innodb_flush_log_at_trx_commit

Description:
In MySQL InnoDB (tested on 8.0.42) with innodb_flush_log_at_trx_commit = 1 and no binary log, a row inserted in a transaction can be:

1. Visible to other sessions after COMMIT returns
2. Then server is killed before the redo log flush completes
3. After restart and crash recovery, the row is lost
4. A concurrent reader had seen the row before crash

Because the row was visible before crash but lost after, this may appear as a dirty read followed by rollback — though strictly it is due to visibility before durability rather than classic isolation violation.
This scenario reliably reproduces with a high-concurrency workload.

Expected Behavior

With innodb_flush_log_at_trx_commit = 1, a committed transaction should be durable and survive a server crash after COMMIT returns.
A statement such as:

INSERT INTO t_ids (id) VALUES (X);COMMIT;SELECT id FROM t_ids WHERE id = X;

should persist the row X after a crash following the COMMIT.

Observed Behavior

Under a specific timing window, when killing the server immediately after the COMMIT but before the redo log flush completes, the row:

* is visible to other sessions before the crash
* is missing after restart
* InnoDB crash recovery rolls it back

This replicates a scenario where:

* Transaction T1 commits in memory and becomes visible
* Reader R sees the row
* kill -9 mysqld happens before redo flush
* After restart, the row is absent

Thus the server appears to have lost a committed transaction even though innodb_flush_log_at_trx_commit = 1.

Why This Happens

InnoDB’s commit path does:

1. Mark transaction as committed in memory
2. Release locks
3. Make row visible
4. Then flush redo log to disk

If a crash occurs in step 4 before redo flush completes, crash recovery does not find the commit in redo and rolls back the transaction. The row becomes invisible after restart even though it was visible before crash.
This matches the behavior described in other contexts where innodb_flush_log_at_trx_commit != 1 can lose transactions, but here it happens even with =1 due to the internal ordering of visibility vs. durability.

Impact

Even with the strictest durability setting (innodb_flush_log_at_trx_commit=1) and REPEATABLE READ, this can lead to lost visible transactions after a crash. This is especially problematic in high-concurrency environments with abrupt server termination.

How to repeat:
Environment

MySQL Server version: 8.0.42
Storage engine: InnoDB
innodb_flush_log_at_trx_commit = 1
sync_binlog = 0
binlog disabledIsolation level: default (REPEATABLE READ)
Operating system: any GNU/Linux
Tested with simple primary-key insert workloads

Steps to Reproduce

1. Create a table:
    CREATE DATABASE test_crash;
    USE test_crash;
    CREATE TABLE t_ids (
        id BIGINT NOT NULL PRIMARY KEY,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    ) ENGINE = InnoDB;
2. Start multiple concurrent writers inserting increasing IDs in transactions with autocommit OFF.
3. Start a reader thread polling latest IDs.
4. Randomly crash the server:
    sleep 0.1 && kill -9 $(pidof mysqld)
5. After restart, observe that some IDs previously observed by the reader are missing.

T1 @ T0:000 — T1 START TRANSACTION
T1 @ T0:005 — INSERT id=1005 (redo buffer & buffer pool updated, no fsync)
T1 @ T0:010 — COMMIT (InnoDB marks committed in memory & releases locks)
R  @ T0:011 — Reader sees id=1005
 Crash @ T0:012 — kill -9 before redo log flush to diskRestart @ T0:050 — crash recovery runs
Recovery: no redo commit for T1, rollback undo
R  @ T0:052 — SELECT does not return id=1005

Suggested fix:
Either:

1. Guarantee that visibility to other sessions occurs only after redo flush reaches disk, or
2. Document this window explicitly in MySQL/InnoDB documentation, as innodb_flush_log_at_trx_commit=1 does not in practice eliminate the visibility-before-durability window under crash conditions.

A related discussion in the MySQL Bug System notes that committing transactions may not be flushed to disk before a crash, leading to data loss under some configurations:
https://bugs.mysql.com/75519 — “Crash server after a transaction is committed, but before redo log is flushed. After recovery, the transaction will have been rolled back.” 
Also, InnoDB crash recovery documentation confirms that only redo logs on disk determine recovery and unflushed transactions are rolled back:
https://dev.mysql.com/doc/en/innodb-recovery.html

Mysql source code comment mentions that this could happen but there is no official mysql documentation for the same.

https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.42/storage/innobase/trx/trx0t...

Thanks for the interesting bug report Divyam Jaiswal. I am not on the MySQL Verification Team, but this caught my attention and I thought I would join the discussion.

> This is especially problematic in high-concurrency environments with abrupt server termination.

I was able to reproduce under very low concurrency (only 2 sessions).  The key to reproducing is to slow-down disk writes, which I have a trick for (link below).  And note that such high latency writes can happen with remote storage (like in the Cloud), so this situation is not only theoretical.

https://jfg-mysql.blogspot.com/2024/06/trick-to-simulate-higher-latency-ios-on-linux.html

Below, with IOs being delayed by 500 ms (update_delay 500, see how it works in post linked above), in all tests without binlogs, the data is visible (SELECT) significantly (1000 ms) before the transaction commits (UPDATE).  This is a symptom of the behavior reported here.  With binlogs, SELECT and UPDATE are very close.  So far, we have not seen that observed data is lost, but we will further down below.

vs="mysql_5.7.44 mysql_8.0.45 mysql_8.4.8 mysql_9.6.0"
ts="log-bin skip-log-bin"
option=no-crash

function do_test() {
  echo; dir=msb_${v//./_}; df="+%Y-%m-%d %T.%N"
  dbda="-c innodb_flush_log_at_trx_commit=1 -c sync_binlog=1 -c server-id=1"

  dbdeployer deploy single $v $dbda -c $t | pv -tN "${v/mysql_/} $t" > /dev/null
  ./$dir/use test <<< "
    drop table if exists t;
    create table t(id int primary key, v int);
    insert into t values (1,1);"

  # yes | mysql to be as quick as possible.
  # in a sub-shell to avoid job noise for running in bg.
  ( touch running
    ( yes "select now(6), v from t;"    |
        ./$dir/use -N test 2> /dev/null | # silence stderr to avoid noise of pkill yes.
        grep -e '2$' | {
          read l; echo "SELECT $l"
          test "$option" != crash || sudo bash -c "echo c > /proc/sysrq-trigger"
          pkill yes; rm running; } )& )

  update_delay 500
  echo "START  $(date "$df")"
  ./$dir/use -N test <<< "do sleep(1); update t set v = 2; select now(6)" | sed 's/^/UPDATE /'
  while sleep 0.1; do test -e running || break; done

  update_delay 0; ./$dir/stop > /dev/null; rm -rf $dir
}

function run_tests() {
  update_delay 0; cd ~/sandboxes
  for v in $vs; do for t in $ts; do do_test; done; done | sed -e "s/2026-..-.. //"
}

run_tests

5.7.44 log-bin: 0:00:08
START  20:06:48.659021031
UPDATE 20:06:54.799731
SELECT 20:06:54.799759  2

5.7.44 skip-log-bin: 0:00:08
START  20:07:06.713410410
SELECT 20:07:07.720290  2
UPDATE 20:07:08.748181

8.0.45 log-bin: 0:00:22
START  20:07:34.646781332
UPDATE 20:07:38.220510
SELECT 20:07:38.220472  2

8.0.45 skip-log-bin: 0:00:21
START  20:08:03.378534340
SELECT 20:08:04.388685  2
UPDATE 20:08:05.932532

8.4.8 log-bin: 0:00:31
START  20:08:41.345671123
UPDATE 20:08:43.916105
SELECT 20:08:43.916052  2

8.4.8 skip-log-bin: 0:00:29
START  20:09:17.427508856
SELECT 20:09:18.438169  2
UPDATE 20:09:19.468077

9.6.0 log-bin: 0:00:30
START  20:09:53.133355395
UPDATE 20:09:55.692126
SELECT 20:09:55.692047  2

9.6.0 skip-log-bin: 0:00:29
START  20:10:28.335975812
SELECT 20:10:29.346578  2
UPDATE 20:10:30.380099

Below shows the data loss situation with 8.4 and without binlogs: we observed 2 before panicking Linux, and we have 1 after crash recovery.  Not shown but tested: it is the same for 5.7, 8.0 and 9.6.

vs=mysql_8.4.8; ts=skip-log-bin; option=crash
run_tests

8.4.8 skip-log-bin: 0:00:31
START  20:12:47.087909239
SELECT 20:12:48.098555  2
Read from remote host 3.128.30.80: Connection reset by peer
Connection to 3.128.30.80 closed.

# connect back...

# start MySQL, and query the data.
( cd ~/sandboxes/msb_*; ./start > /dev/null; ./use -N test <<< "select v from t"; )
1

Thanks for looking into this, J-F Gagné

Quick correction on the second line of the Description:

NOT:
1. Visible to other sessions after COMMIT ~~returns~~

SHOULD BE:
1. Visible to other sessions after COMMIT *in memory*

The Description correctly mentions "in memory" in subsequent paragraphs.