| Bug #119759 | InnoDB allow reads on non-durable data, which can be lost after crash(visibility before durability window) | ||
|---|---|---|---|
| Submitted: | 22 Jan 18:55 | Modified: | 22 Jan 21:14 |
| Reporter: | Divyam Jaiswal | Email Updates: | |
| Status: | Open | Impact on me: | |
| Category: | MySQL Server: InnoDB storage engine | Severity: | S2 (Serious) |
| Version: | 8.0.42 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
| Tags: | data loss, innodb, innodb_flush_log_at_trx_commit | ||
[22 Jan 21:14]
Divyam Jaiswal
Mysql source code comment mentions that this could happen but there is no official mysql documentation for the same. https://github.com/mysql/mysql-server/blob/mysql-cluster-8.0.42/storage/innobase/trx/trx0t...
[1 Feb 20:46]
Jean-François Gagné
Thanks for the interesting bug report Divyam Jaiswal. I am not on the MySQL Verification Team, but this caught my attention and I thought I would join the discussion. > This is especially problematic in high-concurrency environments with abrupt server termination. I was able to reproduce under very low concurrency (only 2 sessions). The key to reproducing is to slow-down disk writes, which I have a trick for (link below). And note that such high latency writes can happen with remote storage (like in the Cloud), so this situation is not only theoretical. https://jfg-mysql.blogspot.com/2024/06/trick-to-simulate-higher-latency-ios-on-linux.html Below, with IOs being delayed by 500 ms (update_delay 500, see how it works in post linked above), in all tests without binlogs, the data is visible (SELECT) significantly (1000 ms) before the transaction commits (UPDATE). This is a symptom of the behavior reported here. With binlogs, SELECT and UPDATE are very close. So far, we have not seen that observed data is lost, but we will further down below. vs="mysql_5.7.44 mysql_8.0.45 mysql_8.4.8 mysql_9.6.0" ts="log-bin skip-log-bin" option=no-crash function do_test() { echo; dir=msb_${v//./_}; df="+%Y-%m-%d %T.%N" dbda="-c innodb_flush_log_at_trx_commit=1 -c sync_binlog=1 -c server-id=1" dbdeployer deploy single $v $dbda -c $t | pv -tN "${v/mysql_/} $t" > /dev/null ./$dir/use test <<< " drop table if exists t; create table t(id int primary key, v int); insert into t values (1,1);" # yes | mysql to be as quick as possible. # in a sub-shell to avoid job noise for running in bg. ( touch running ( yes "select now(6), v from t;" | ./$dir/use -N test 2> /dev/null | # silence stderr to avoid noise of pkill yes. grep -e '2$' | { read l; echo "SELECT $l" test "$option" != crash || sudo bash -c "echo c > /proc/sysrq-trigger" pkill yes; rm running; } )& ) update_delay 500 echo "START $(date "$df")" ./$dir/use -N test <<< "do sleep(1); update t set v = 2; select now(6)" | sed 's/^/UPDATE /' while sleep 0.1; do test -e running || break; done update_delay 0; ./$dir/stop > /dev/null; rm -rf $dir } function run_tests() { update_delay 0; cd ~/sandboxes for v in $vs; do for t in $ts; do do_test; done; done | sed -e "s/2026-..-.. //" } run_tests 5.7.44 log-bin: 0:00:08 START 20:06:48.659021031 UPDATE 20:06:54.799731 SELECT 20:06:54.799759 2 5.7.44 skip-log-bin: 0:00:08 START 20:07:06.713410410 SELECT 20:07:07.720290 2 UPDATE 20:07:08.748181 8.0.45 log-bin: 0:00:22 START 20:07:34.646781332 UPDATE 20:07:38.220510 SELECT 20:07:38.220472 2 8.0.45 skip-log-bin: 0:00:21 START 20:08:03.378534340 SELECT 20:08:04.388685 2 UPDATE 20:08:05.932532 8.4.8 log-bin: 0:00:31 START 20:08:41.345671123 UPDATE 20:08:43.916105 SELECT 20:08:43.916052 2 8.4.8 skip-log-bin: 0:00:29 START 20:09:17.427508856 SELECT 20:09:18.438169 2 UPDATE 20:09:19.468077 9.6.0 log-bin: 0:00:30 START 20:09:53.133355395 UPDATE 20:09:55.692126 SELECT 20:09:55.692047 2 9.6.0 skip-log-bin: 0:00:29 START 20:10:28.335975812 SELECT 20:10:29.346578 2 UPDATE 20:10:30.380099 Below shows the data loss situation with 8.4 and without binlogs: we observed 2 before panicking Linux, and we have 1 after crash recovery. Not shown but tested: it is the same for 5.7, 8.0 and 9.6. vs=mysql_8.4.8; ts=skip-log-bin; option=crash run_tests 8.4.8 skip-log-bin: 0:00:31 START 20:12:47.087909239 SELECT 20:12:48.098555 2 Read from remote host 3.128.30.80: Connection reset by peer Connection to 3.128.30.80 closed. # connect back... # start MySQL, and query the data. ( cd ~/sandboxes/msb_*; ./start > /dev/null; ./use -N test <<< "select v from t"; ) 1 Thanks for looking into this, J-F Gagné

Description: In MySQL InnoDB (tested on 8.0.42) with innodb_flush_log_at_trx_commit = 1 and no binary log, a row inserted in a transaction can be: 1. Visible to other sessions after COMMIT returns 2. Then server is killed before the redo log flush completes 3. After restart and crash recovery, the row is lost 4. A concurrent reader had seen the row before crash Because the row was visible before crash but lost after, this may appear as a dirty read followed by rollback — though strictly it is due to visibility before durability rather than classic isolation violation. This scenario reliably reproduces with a high-concurrency workload. Expected Behavior With innodb_flush_log_at_trx_commit = 1, a committed transaction should be durable and survive a server crash after COMMIT returns. A statement such as: INSERT INTO t_ids (id) VALUES (X);COMMIT;SELECT id FROM t_ids WHERE id = X; should persist the row X after a crash following the COMMIT. Observed Behavior Under a specific timing window, when killing the server immediately after the COMMIT but before the redo log flush completes, the row: * is visible to other sessions before the crash * is missing after restart * InnoDB crash recovery rolls it back This replicates a scenario where: * Transaction T1 commits in memory and becomes visible * Reader R sees the row * kill -9 mysqld happens before redo flush * After restart, the row is absent Thus the server appears to have lost a committed transaction even though innodb_flush_log_at_trx_commit = 1. Why This Happens InnoDB’s commit path does: 1. Mark transaction as committed in memory 2. Release locks 3. Make row visible 4. Then flush redo log to disk If a crash occurs in step 4 before redo flush completes, crash recovery does not find the commit in redo and rolls back the transaction. The row becomes invisible after restart even though it was visible before crash. This matches the behavior described in other contexts where innodb_flush_log_at_trx_commit != 1 can lose transactions, but here it happens even with =1 due to the internal ordering of visibility vs. durability. Impact Even with the strictest durability setting (innodb_flush_log_at_trx_commit=1) and REPEATABLE READ, this can lead to lost visible transactions after a crash. This is especially problematic in high-concurrency environments with abrupt server termination. How to repeat: Environment MySQL Server version: 8.0.42 Storage engine: InnoDB innodb_flush_log_at_trx_commit = 1 sync_binlog = 0 binlog disabledIsolation level: default (REPEATABLE READ) Operating system: any GNU/Linux Tested with simple primary-key insert workloads Steps to Reproduce 1. Create a table: CREATE DATABASE test_crash; USE test_crash; CREATE TABLE t_ids ( id BIGINT NOT NULL PRIMARY KEY, created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP ) ENGINE = InnoDB; 2. Start multiple concurrent writers inserting increasing IDs in transactions with autocommit OFF. 3. Start a reader thread polling latest IDs. 4. Randomly crash the server: sleep 0.1 && kill -9 $(pidof mysqld) 5. After restart, observe that some IDs previously observed by the reader are missing. T1 @ T0:000 — T1 START TRANSACTION T1 @ T0:005 — INSERT id=1005 (redo buffer & buffer pool updated, no fsync) T1 @ T0:010 — COMMIT (InnoDB marks committed in memory & releases locks) R @ T0:011 — Reader sees id=1005 Crash @ T0:012 — kill -9 before redo log flush to diskRestart @ T0:050 — crash recovery runs Recovery: no redo commit for T1, rollback undo R @ T0:052 — SELECT does not return id=1005 Suggested fix: Either: 1. Guarantee that visibility to other sessions occurs only after redo flush reaches disk, or 2. Document this window explicitly in MySQL/InnoDB documentation, as innodb_flush_log_at_trx_commit=1 does not in practice eliminate the visibility-before-durability window under crash conditions. A related discussion in the MySQL Bug System notes that committing transactions may not be flushed to disk before a crash, leading to data loss under some configurations: https://bugs.mysql.com/75519 — “Crash server after a transaction is committed, but before redo log is flushed. After recovery, the transaction will have been rolled back.” Also, InnoDB crash recovery documentation confirms that only redo logs on disk determine recovery and unflushed transactions are rolled back: https://dev.mysql.com/doc/en/innodb-recovery.html