Bug #89272 Binlog and Engine become inconsistent when binlog cache file gets out of space
Submitted: 16 Jan 22:52 Modified: 17 Jan 8:42
Reporter: Yoshinori Matsunobu (OCA) Email Updates:
Status: Verified Impact on me:
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.6.39 OS:Any
Assigned to: CPU Architecture:Any

[16 Jan 22:52] Yoshinori Matsunobu
In the following conditions, binary logs and storage engines become inconsistent. I could reproduce in 5.6.39.

- GTID is enabled
- Running many inserts in a single transaction. The whole transaction should be bigger than binlog cache size (so that temporary files are created under tmpdir)
- tmpdir gets out of space
- Then, some inserts fail because of errors like "ERROR 3 (HY000): Error writing file '/tmp/ML1pMt11' (Errcode: 28 - No space left on device)". The failed insert itself is rolled back. But other inserts not hitting the error continue in the transaction.
- Applications forcing commit. Then inserts that don't hit out of space get committed to the engine. But nothing is persisted in binlog. So on a replication environment, master has many data but slaves don't have anything.

See how to repeat to reproduce.

Credits to my colleague Jeff Jiang who originally found this problem.

How to repeat:
1. Run MySQL instance with at least the following settings.

default_storage_engine = InnoDB
tmpdir = /tmp
# datadir and log_bin should be in different partitions from tmpdir
datadir = /data/mysql/3306/
log_bin = /binlogs/binary-logs-3306
gtid_mode = ON
enforce_gtid_consistency = ON
binlog_cache_size = 256K

2. Prepare a script that inserts many rows in a single transaction. Here is an example, to insert 10k rows in one transaction.


print "begin;\n";
for(my $i=1; $i<=10000; $i++) {
  print "insert into t1 values ($i, $i, $i);\n";
print "commit;\n";

3. Fill up tmpdir partition (e.g. dd command)

4. Run the following queries
create table t1 (id int primary key, value int, value2 int) engine=innodb;
source /data/insert.sql

# the source command will fail because of out of space. Then force commit like below.


5. Check binlog and engines.
select count(*) from t1;
-> many rows, like 9996 rows are there.

show master status;
-> nothing in the transaction is persisted -> mismatch

Suggested fix:
- Force to roll back entire transaction, not just a failed statement
[16 Jan 23:26] Zhefu Jiang
When trying to understand what happened underneath, I noticed the following code


There was one INSERT hit ENOSPC, on that INSERT statement, there were two flush attempts: the first flush attempt during failed to flush to disk and moved the pointers in IO Cache so that it seems as if the write happened. The second flush during the rollback of the INSERT statement then simply ignored the previous buffer.

Then finally if we do a COMMIT, it also tries to flush the IO cache for several times, the first one hits the ENOSPC and the follow flushes just finishes without error. So COMMIT also passes, but we effectively discard all binlog generated.

Can we make sure that IO Cache doesn't discard content in this way?
[17 Jan 8:42] Umesh Shastry
Hello Yoshinori,

Thank you for the report and repeatable steps.