Bug #68953 Binlog write errors silently ignored
Submitted: 14 Apr 2013 15:29 Modified: 16 Oct 2015 14:30
Reporter: Laurynas Biveinis (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S1 (Critical)
Version:5.6.10, 5.6.12, 5.5.43 OS:Any
Assigned to: CPU Architecture:Any
Tags: binlog
Triage: Needs Triage: D2 (Serious)

[14 Apr 2013 15:29] Laurynas Biveinis
Description:
Binlog write errors (at least some of, in this instance originating in MYSQL_BIN_LOG::do_write_cache) are silently ignored. This seems to be different from bug 37148 in that this is a new 5.6 code, which presumably should have written with error handling in mind, and a regression from 5.5.

What makes it even more puzzling, at least to me as an outsider, is that the method in question has elaborate fault injection setup, which presumably should test the same very thing. I'd love to review the testcase that exercises it, but where is it?

How to repeat:
1) patch MYSQL_BIN_LOG::do_write_cache to inject a write fault at the first non-carry my_b_write() call:

=== modified file 'sql/binlog.cc'
--- sql/binlog.cc	2012-12-11 16:29:42 +0000
+++ sql/binlog.cc	2013-04-14 15:11:11 +0000
@@ -5189,7 +5189,9 @@
             int4store(ev + EVENT_LEN_OFFSET, event_len + BINLOG_CHECKSUM_LEN);
             remains= fix_log_event_crc(cache->read_pos, hdr_offs, event_len,
                                        length, &crc);
-            if (my_b_write(&log_file, ev, 
+            DBUG_EXECUTE_IF("fail_binlog_write_1",
+                            errno= 28; DBUG_RETURN(ER_ERROR_ON_WRITE););
+            if (my_b_write(&log_file, ev,
                            remains == 0 ? event_len : length - hdr_offs))
               DBUG_RETURN(ER_ERROR_ON_WRITE);
             if (remains == 0)

2) Run this MTR testcase, originating from Percona Server 5.5:

source include/have_debug.inc;
source include/have_innodb.inc;
source include/have_log_bin.inc;
source include/have_binlog_format_mixed_or_statement.inc;

# This one doesn't hit in 5.6!
CALL mtr.add_suppression("Error writing file 'master-bin'");

RESET MASTER;

CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=innodb;
INSERT INTO t1 VALUES(0);
SET SESSION debug='+d,fail_binlog_write_1';
# --error ER_ERROR_ON_WRITE - isn't returned on 5.6!
INSERT INTO t1 VALUES(1);
# --error ER_ERROR_ON_WRITE
INSERT INTO t1 VALUES(2);
SET SESSION debug='';
INSERT INTO t1 VALUES(3);
SELECT * FROM t1;

--replace_regex /\/\* xid=.* \*\//\/* XID *\// /Server ver: .*, Binlog ver: .*/Server ver: #, Binlog ver: #/ /table_id: [0-9]+/table_id: #/
--replace_column 1 BINLOG 2 POS 5 ENDPOS
SHOW BINLOG EVENTS;

DROP TABLE t1;

3) Observe no error and the following output:

worker[1] Using MTR_BUILD_THREAD 300, with reserved ports 13000..13009
CALL mtr.add_suppression("Error writing file 'master-bin'");
RESET MASTER;
CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=innodb;
INSERT INTO t1 VALUES(0);
SET SESSION debug='+d,fail_binlog_write_1';
INSERT INTO t1 VALUES(1);
INSERT INTO t1 VALUES(2);
SET SESSION debug='';
INSERT INTO t1 VALUES(3);
SELECT * FROM t1;
a
0
3
SHOW BINLOG EVENTS;
Log_name	Pos	Event_type	Server_id	End_log_pos	Info
BINLOG	POS	Format_desc	1	ENDPOS	Server ver: #, Binlog ver: #
BINLOG	POS	Query	1	ENDPOS	use `test`; CREATE TABLE t1 (a INT PRIMARY KEY) ENGINE=innodb
BINLOG	POS	Query	1	ENDPOS	BEGIN
BINLOG	POS	Query	1	ENDPOS	use `test`; INSERT INTO t1 VALUES(0)
BINLOG	POS	Xid	1	ENDPOS	COMMIT /* XID */
BINLOG	POS	Query	1	ENDPOS	BEGIN
BINLOG	POS	Query	1	ENDPOS	use `test`; INSERT INTO t1 VALUES(3)
BINLOG	POS	Xid	1	ENDPOS	COMMIT /* XID */
DROP TABLE t1;
main.binlog_ioerr                        [ pass ]     71

Thus, we have no trace of rows with 1 and 2 in binlog nor in the table (it is good that they agree), but no trace of any error reported neither.
[14 Apr 2013 15:38] Laurynas Biveinis
5.6.10 to be exact.
[14 Apr 2013 15:40] Laurynas Biveinis
My comment that this is a regression from 5.5 might be incorrect. It may or may not fail in 5.5, I haven't checked.
[14 Apr 2013 16:05] Laurynas Biveinis
The MTR testcase origin is actually MariaDB 5.5. It was imported to Percona Server as a part of their group commit feature.
[15 Apr 2013 18:28] Sveta Smirnova
Thank you for the report.

Verified as described.
[5 May 2014 11:24] Shane Bester
related: http://bugs.mysql.com/bug.php?id=51014
[23 Apr 2015 4:41] Laurynas Biveinis
Adding "regression" tag
[23 Apr 2015 23:35] Santosh Praneeth Banda
Clients think the transaction is succeeded even though binlog flush fails and the transaction is rolled back. This is serious issue and I don't understand why this is not fixed yet. I could fix this bug using the following patch

@@ -6283,6 +6283,8 @@ bool MYSQL_BIN_LOG::write_cache(THD *thd, binlog_cache_data *cache_data,

@@ -6341,6 +6343,7 @@ err:
                     errno, my_strerror(errbuf, sizeof(errbuf), errno));
   }
   thd->commit_error= THD::CE_FLUSH_ERROR;
+  set_write_error(thd, cache_data->is_trx_cache());
[4 May 2015 9:40] Venkatesh Duggirala
Post from Developer:
====================

The same issue is happening on mysql-5.5.43 as well. 
Hence it is not a regression. (Removing Regression tag
[16 Oct 2015 14:30] David Moss
Thanks for your feedback, this has been fixed and the following was noted in the 5.6.28 and 5.7.10 changelogs:

Fatal errors encountered during flushing or synchronizing the binary log were being ignored. Such errors are now caught and handled depending on the setting of binlog_error_action.

Additionally the documentation for binlog_error_action has also been improved:
https://dev.mysql.com/doc/refman/5.7/en/replication-options-binary-log.html#sysvar_binlog_...