MySQL Bugs: #59622: Data loss after crash at point crash_commit_after

Bug #59622	Data loss after crash at point crash_commit_after_prepare for binlog crash-safe
Submitted:	19 Jan 2011 22:29	Modified:	10 Feb 2011 8:03
Reporter:	Serge Kozlov	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.5, trunk	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	crash, crash_commit_after_prepare, innodb, nuts, replication

Description:
If master (compiled with debug support) has the crash at crash_commit_after_prepare at moment of transaction commit then after restart the server and restoring replication databases on master and on slave are different.

How to repeat:
I've add new test case to NUTS: rep_func::crash_safe_master

How to run:

<NUTS_HOME>./bin/nuts.pl --build=<PATH_TO_TREE> --suite=rep_func --test=crash_safe_master --verbose --combination=row-crashpoint1

The output are:

ok 11 - start slave 2
ok 12 - wait that slave 2 starts.
ok 13 - synchronize slave 2 with master 1
# Dump file 1: /home/ksm/oracle/QA/nuts/var/tmp/rep_func::crash_safe_master-row-crashpoint1/1_test.t1.sql
# Dump file 2: /home/ksm/oracle/QA/nuts/var/tmp/rep_func::crash_safe_master-row-crashpoint1/2_test.t1.sql
not ok 14 - Query 'SELECT COUNT(*) FROM t1' returns different number of rows: 5, 3

#   Failed test 'Query 'SELECT COUNT(*) FROM t1' returns different number of rows: 5, 3'
#   at /home/ksm/oracle/QA/nuts/lib/My/Nuts/Library/Kernel/ServerResult.pm line 156.
# Query 'SELECT COUNT(*) FROM t1' returns different number of rows: 5, 3
 Failed 1/14 subtests

The test case1 in rpl_crash_safe_master_innodb.test is
addressing the problem. But it's running fine and the
data is consistent on master and slave after master
crash at crash_commit_after_prepare point.
Please don't forget suppress the following warnings
and errors:

call mtr.add_suppression("Attempting backtrace");
call mtr.add_suppression("allocated tablespace *., old maximum was 0");
call mtr.add_suppression("Error in Log_event::read_log_event()");
call mtr.add_suppression("Buffered warning: Performance schema disabled");
CREATE TABLE t1(a LONGBLOB) ENGINE=INNODB;

-- echo # Test case1: Set DEBUG POINT before binlog to make
-- echo #             the master crash for transaction
BEGIN;
let $rows= 3;
WHILE($rows)
{
  INSERT INTO t1 (a) VALUES (REPEAT('a',2));
  dec $rows;
}
# Write file to make mysql-test-run.pl expect crash and restart
-- exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
SET SESSION debug="d,crash_commit_after_prepare";
# Run the crashing query
-- error 2013
COMMIT;

-- source include/wait_until_disconnected.inc
-- enable_reconnect
-- echo # Restart the master server
-- exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
-- source include/wait_until_connected_again.inc
-- disable_reconnect

-- echo # Test the transaction statements will not be binlogged
-- source include/show_binlog_events.inc

-- echo # On master, test the data will be rolled back after restart.
SELECT COUNT(*) FROM t1;

sync_slave_with_master;
-- echo # On slave, test replication will work fine, and the data 
-- echo #           is not replicated
SELECT COUNT(*) FROM t1;

Result:
......
# Restart the master server
# Test the transaction statements will not be binlogged
show binlog events in 'master-bin.000001' from <binlog_start>;
Log_name        Pos     Event_type      Server_id       End_log_pos     Info
master-bin.000001       #       Query   #       #       use `test`; CREATE TABLE t1(a LONGBLOB) ENGINE=INNODB
# On master, test the data will be rolled back after restart.
SELECT COUNT(*) FROM t1;
COUNT(*)
0
# On slave, test replication will work fine, and the data
#           is not replicated
SELECT COUNT(*) FROM t1;
COUNT(*)
0
DROP TABLE t1;

What is the difference with yours? I will create a test case base on
the difference to double check the problem.

You use simple table structure for MTR test case. Nuts test generates random statements in transaction. It could be a reason why results are different for MTR/Nuts

I reproduced the problem base on the newest version of nuts
as following:
ok - synchronize slave 2 with master 1
# Dump file 1: /home/daogang/bzrwork/pushwl5493/nuts/var/tmp/rep_func::crash_safe_master-row-crashpoint1/1_test.t1.sql
# Dump file 2: /home/daogang/bzrwork/pushwl5493/nuts/var/tmp/rep_func::crash_safe_master-row-crashpoint1/2_test.t1.sql
not ok - Query 'SELECT COUNT(*) FROM t1' returns different number of rows: 3, 4

#   Failed test 'Query 'SELECT COUNT(*) FROM t1' returns different number of rows: 3, 4'
#   at /home/daogang/bzrwork/pushwl5493/nuts/lib/My/Nuts/Library/Kernel/ServerResult.pm line 156.
# Query 'SELECT COUNT(*) FROM t1' returns different number of rows: 3, 4
Failed 1/14 subtests

Please check the log: var/log/rep_func\:\:crash_safe_master-row-crashpoint1/rep_func\:\:crash_safe_master-row-crashpoint1-report.log

I found out the following executed queries:

QUERY 1 "INSERT INTO t1 VALUES (0, 0, 0, 16849, 5025236370650103808, 0.22E-22, '6695-06-17 14:46:03', 0)"
GET_CONNECTION 1 root
QUERY 1 "DELETE FROM t1 WHERE c_05 = 5025236370650103808"
GET_CONNECTION 1 root

The row is deleted after insert. It's the reason that the different number
of rows: 3, 4 from Query 'SELECT COUNT(*) FROM t1'. 

So please update test to delete the QUERY 1 "DELETE FROM t1 WHERE c_05 = 5025236370650103808". Then it will work fine.

So it's a test issue, not a crash safe issue.

Please make sure the transaction plugin is loaded. But it's really strange.
I'm generating a mtr test case for it.

CREATE TABLE t1 (c_01 TINYINT SIGNED, c_02 DOUBLE SIGNED, c_03 MEDIUMINT SIGNED, c_04 SMALLINT UNSIGNED NULL, c_05 BIGINT UNSIGNED NOT NULL, c_06 DOUBLE UNSIGNED, c_07 DATETIME NULL, c_08 TINYINT SIGNED, PRIMARY KEY (c_05));

What's the default engine for t1 in nuts?

The problem still can't be reproduced even if we are using complex
date structure for table as following:

-- source include/not_valgrind.inc
-- source include/not_embedded.inc
-- source include/master-slave.inc
-- source include/have_debug.inc
-- source include/have_binlog_format_row.inc

# Reset master
connection slave;
--source include/stop_slave.inc

connection master;
RESET MASTER;

connection slave;
--source include/start_slave.inc

connection master;
call mtr.add_suppression("Attempting backtrace");
call mtr.add_suppression("allocated tablespace *., old maximum was 0");
call mtr.add_suppression("Error in Log_event::read_log_event()");
call mtr.add_suppression("Buffered warning: Performance schema disabled");

-- let $binlog_start= query_get_value(SHOW MASTER STATUS, Position, 1)
-- let $binlog_file= query_get_value(SHOW MASTER STATUS, File, 1)
#CREATE TABLE t1(a LONGBLOB) ENGINE=INNODB;
CREATE TABLE t1 (c_01 TINYINT SIGNED, c_02 DOUBLE SIGNED, c_03 MEDIUMINT SIGNED, c_04 SMALLINT UNSIGNED NULL, c_05 BIGINT UNSIGNED NOT NULL, c_06 DOUBLE UNSIGNED, c_07 DATETIME NULL, c_08 TINYINT SIGNED, PRIMARY KEY (c_05)) ENGINE=INNODB;

CREATE TABLE t2(a LONGBLOB) ENGINE=INNODB;

-- echo # Test case1: Set DEBUG POINT before binlog to make
-- echo #             the master crash for transaction
DELETE FROM t1 WHERE c_05 = 9785199741266690048;
UPDATE t1 SET c_05 = 9785199741266690048, c_04= 14772, c_02= 0 WHERE c_05 = 9785199741266690048;
BEGIN;
INSERT INTO t1 VALUES (0, 0, 0, 16849, 5025236370650103808, 0.22E-22, '6695-06-17 14:46:03', 0);
UPDATE t1 SET c_05 = 13382162220930826240, c_01= 0, c_03= -2556622, c_02= 0, c_08= -28 WHERE c_05 = 13382162220930826240;
COMMIT;
BEGIN;
REPLACE INTO t1 VALUES (0, 0, -2780698, 36412, 5169992599251648512, 0, '9991-09-27 13:33:12', 0);
UPDATE t1 SET c_05 = 5169992599251648512, c_04= 16331, c_07= '3827-03-28 01:24:56', c_03= -3763596, c_06= 0, c_08= 0 WHERE c_05 = 5169992599251648512;
REPLACE INTO t1 VALUES (0, 0.43E-113, -3896527, 4379, 5555829872288268288, 0.80E-284, '6241-04-04 07:10:44', 0);
BEGIN;
COMMIT;

INSERT INTO t1 VALUES (-32, 0, -2042839, 51290, 8041760290208677888, 0.59E-27, '9391-09-17 15:07:00', 0);
DELETE FROM t1 WHERE c_05 = 18047999901170139136;
UPDATE t1 SET c_05 = 5555829872288268288, c_07= '9238-02-24 20:18:45' WHERE c_05 = 5555829872288268288;
BEGIN;
ALTER TABLE t1 DROP c_08;
REPLACE INTO t1 VALUES (46, 0, 0, 8894, 10602834005396160512, 0, '9623-12-02 13:34:23');
BEGIN;
COMMIT;

BEGIN;
INSERT INTO t2 (a) VALUES (REPEAT('a',4));
INSERT INTO t2 (a) VALUES (REPEAT('a',4));
DELETE FROM t1 WHERE c_05 = 5025236370650103808;
INSERT INTO t2 (a) VALUES (REPEAT('a',4));
# Write file to make mysql-test-run.pl expect crash and restart
-- exec echo "wait" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
SET SESSION debug="d,crash_commit_after_prepare";
# Run the crashing query
-- error 2013
COMMIT;

-- source include/wait_until_disconnected.inc
-- enable_reconnect
-- echo # Restart the master server
-- exec echo "restart" > $MYSQLTEST_VARDIR/tmp/mysqld.1.expect
-- source include/wait_until_connected_again.inc
-- disable_reconnect

-- echo # Test the transaction statements will not be binlogged
-- source include/show_binlog_events.inc

-- echo # On master, test the data will be rolled back after restart.
SELECT COUNT(*) FROM t1;

sync_slave_with_master;
-- echo # On slave, test replication will work fine, and the data 
-- echo #           is not replicated
SELECT COUNT(*) FROM t1;

connection master;
DROP TABLE t1, t2;
sync_slave_with_master;
-- source include/stop_slave.inc

So the above test guarantee that the transaction is crash safe
when setting crash_commit_after_prepare. So it must be other
problem cause the inconsistent. So the problem goes out the
scope of the WL#5493.

All the tests are pasted if we just set crash_commit_before and
crash_commit_after_prepare as crash point as following:

my @crash_points = (
    "crash_commit_before",
    "crash_commit_after_prepare"
    #"crash_commit_after_log", 
    #"crash_commit_after", 
    #"crash_before_writing_xid", 
    #"half_binlogged_transaction"
);

daogang@PC:~/bzrwork/pushwl5493/nuts$ ./bin/nuts.pl --build=../../mysql-trunk-wl5493 --suite=rep_func --test=crash_safe_master --verbose --combination=row-crashpoint1
[16:19:16] rep_func::crash_safe_master-row-crashpoint1 .. 
1..14
ok - executing command SET GLOBAL BINLOG_FORMAT='ROW'
ok - executing command SET SESSION BINLOG_FORMAT='ROW'
ok - executing command SET GLOBAL BINLOG_FORMAT='ROW'
ok - attach slave 2 to master 1
ok - synchronize slave 2 with master 1
ok - executing command CREATE DATABASE IF NOT EXISTS test
ok - executing command USE test
# BEGIN
......
# SET SESSION DEBUG='d,crash_commit_after_prepare'
# COMMIT
ok - executing command SET SESSION DEBUG='d,crash_commit_after_prepare'
ok - executing command SET GLOBAL BINLOG_FORMAT='ROW'
ok - executing command SET SESSION BINLOG_FORMAT='ROW'
ok - start slave 2
ok - wait that slave 2 starts.
ok - synchronize slave 2 with master 1
ok - Databases have no difference
ok   152863 ms
[16:21:50]
All tests successful.
Files=1, Tests=14, 154 wallclock secs ( 0.54 usr  0.14 sys +  8.05 cusr  0.53 csys =  9.26 CPU)
Result: PASS
daogang@PC:~/bzrwork/pushwl5493/nuts$

No test is executed if we just set crash_commit_after_prepare
as crash point as following:

my @crash_points = (
    "crash_commit_after_prepare"
);

daogang@PC:~/bzrwork/pushwl5493/nuts$ ./bin/nuts.pl --build=../../mysql-trunk-wl5493 --suite=rep_func --test=crash_safe_master --verbose --combination=row-crashpoint1
[16:38:06]
Files=0, Tests=0,  0 wallclock secs ( 0.02 usr +  0.00 sys =  0.02 CPU)
Result: NOTESTS

So I don't understand why data will loss after crash at point
 crash_commit_after_prepare for binlog crash-safe even if we
have no tests on the crash_commit_after_prepare point.

So It's a test issue, not a crash safe issue.

This Nuts test uses "combination": same test run several times with different options. Original tests is rep_func::crash_safe_master. Combinations are the list (array) of binlog formats and the list of crash points. Running test with combination has following format:
<test_name>-<combination1>-<combination2>-<combinationN>
Nuts option '--combination' filters combination.
So command line <NUTS_HOME>./bin/nuts.pl --build=<PATH_TO_TREE> --suite=rep_func --test=crash_safe_master --verbose --combination=row-crashpoint1 means "run test crash_safe_master from suite rep_func only for combination row-crashpoint1".
"crashpoint1" is $crash_points[1] or in other words the second value of that array. 
It is mistake if you have 1 element in array but ask 2nd for combination.

Test case for MTR

Attachment: bug59622.tgz (application/x-gzip, text), 148.62 KiB.

Added test case above is based on NUTS dump of statements for rep_func::crash_safe_master and MTR test (test1) rpl_crash_safe_master.test

The test is updated. Now it supports engines and the correct command line is following:
<NUTS_HOME>./bin/nuts.pl --build=<PATH_TO_TREE> --suite=rep_func --test=crash_safe_master
--verbose --combination=row-innodb-crashpoint1

We can reproduce the problem base on the above huge test.
The problem is that the crashed transaction can't be
rollbacked by innodb engine(maybe other engines) successfully
when server recovering.
But the strange thing is that the hton->rollback_by_xid(...)
is called and executed successfully.

Please take a look:
static my_bool xarecover_handlerton(THD *unused, plugin_ref plugin,
                                    void *arg)
{
  .....
  hton->rollback_by_xid(hton, info->list+i);
  .....
}

innobase_rollback_by_xid(
/*=====================*/
        handlerton*     hton,   /*!< in: InnoDB handlerton */
        XID*            xid)    /*!< in: X/Open XA transaction
                                identification */
{
        trx_t*  trx;

        DBUG_ASSERT(hton == innodb_hton_ptr);

        trx = trx_get_trx_by_xid(xid);

        if (trx) {
                return(innobase_rollback_trx(trx));
        } else {
                return(XAER_NOTA);
        }
}

static
int
innobase_rollback_trx(
/*==================*/
        trx_t*  trx)    /*!< in: transaction */
{
        int     error = 0;

        DBUG_ENTER("innobase_rollback_trx");
        DBUG_PRINT("trans", ("aborting transaction"));

        /* Release a possible FIFO ticket and search latch. Since we will
        reserve the trx_sys_t::mutex, we have to release the search system
        latch first to obey the latching order. */

        innobase_release_stat_resources(trx);

        /* If we had reserved the auto-inc lock for some table (if
        we come here to roll back the latest SQL statement) we
        release it now before a possibly lengthy rollback */

        lock_unlock_table_autoinc(trx);

        error = trx_rollback_for_mysql(trx);

        DBUG_RETURN(convert_error_code_to_mysql(error, 0, NULL));
}

The above call link works fine.

But why innodb can't rollback the transaction successfully
when server recovering?

I have trouble understanding the note [27 Jan 12:05] Daogang Qu.

Crash recovery is supposed to work like this in InnoDB:

(1) apply the redo log (ib_logfile*) to reconstruct the pages that were dirty in the buffer pool during the crash
(2) if a DDL transaction was running during the crash, roll it back
(3) initiate a background thread to roll back all transactions that were active (not rolled back, committed, or XA PREPAREd) during the crash

Client requests can be served during step (3). Clients can do XA RECOVER to list the prepared transactions inside InnoDB, and then do XA ROLLBACK or XA COMMIT. Bug #59440 fixes a race condition there.

I do not know the crash recovery outside InnoDB. Apparently, the server must store some XA information in the binlog, and it may automatically issue XA ROLLBACK (or possibly XA COMMIT) under certain conditions.

Daogang, can you please restate your question? Are you suggesting that XA ROLLBACK returns OK, even though InnoDB is not rolling back the transaction? If so, how did you determine that the transaction still exists? SELECT * FROM INFORMATION_SCHEMA.INNODB_TRX?

Hi Marko,
The patch of Bug#59440 is not based on the source code mysql-trunk tree.
So It's not compatible with the source code of mysql-trunk tree. The
compiler complained that it can't find the conc_state 'conc_state'
member for trx_t.

Hi Marko,
The scenario is as following:
1. Set DEBUG POINT at 'crash_commit_after_prepare' crash point to
make a transaction crashed at the point.

2. Then restart the server after the crash. The prepared transaction
should be rollbacked when server recovering, normally the INSERT stmt
in the transcation will not really insert data into inndb tables,
because the transaction is rollbacked by invoking
innobase_rollback_by_xid(...) (Please correct me if I'm wrong).
But the data is inserted into the innodb table in the above crashed
transaction in above huge test. That's the problem.

3. The strange thing is that the xid saved in prepare phase of the
transaction is retrieved truely and the transaction is also got rightly
base on the xid when invoking innobase_rollback_by_xid(...).
Looks the crashed data is rollbacked. But why the data is inserted
into the innodb table in the transaction?

4. I added some test info in the innobase_rollback_by_xid(...) and
   ha_commit_trans(...) fuction before the 'crash_commit_after_prepare'
   crash point for checking step 3 as following:

innobase_rollback_by_xid(
/*=====================*/
        handlerton*     hton,   /*!< in: InnoDB handlerton */
        XID*            xid)    /*!< in: X/Open XA transaction
                                identification */
{
        trx_t*  trx;

        DBUG_ASSERT(hton == innodb_hton_ptr);

        fprintf(stderr,
                    "rollback input formatid== %ld, gtrid_length== %ld, bqual_length== %ld, data== %s. \n", xid->formatID, xid->gtrid_length, xid->bqual_length, xid->data);

        trx = trx_get_trx_by_xid(xid);

        fprintf(stderr,
                "rollback get trx formatid== %ld, gtrid_length== %ld, bqual_length== %ld, data== %s. \n", trx->xid.formatID, trx->xid.gtrid_length, trx->xid.bqual_length, trx->xid.data);

        if (trx) {
                return(innobase_rollback_trx(trx));
        } else {
                return(XAER_NOTA);
        }
} 

int ha_commit_trans(THD *thd, bool all)
{
       ......

       if (! ha_info->is_trx_read_write())
          continue;
        /*
          Sic: we know that prepare() is not NULL since otherwise
          trans->no_2pc would have been set.
        */
        if ((err= ht->prepare(ht, thd, all)))
        {
          my_error(ER_ERROR_DURING_COMMIT, MYF(0), err);
          error= 1;
        }
        status_var_increment(thd->status_var.ha_prepare_count);
      }

      if (is)
        sql_print_information("prepare input formatid== %ld, gtrid_length== %ld, bqual_length== %ld, data== %s. \n", thd->transaction.xid_state.xid.formatID, thd->transaction.xid_state.xid.gtrid_length, thd->transaction.xid_state.xid.bqual_length, thd->transaction.xid_state.xid.data);

      DBUG_EXECUTE_IF("crash_commit_after_prepare", DBUG_SUICIDE(););
      ......
}

5. Run the test which can reproduce the problem. Then the see the
   following log info:

110128 12:17:32 [Note] prepare input formatid== 1, gtrid_length== 20, bqual_length== 0, data== MySQLXid^A.
......
110128 12:17:37 [Note] Starting crash recovery...
110128 12:17:37  InnoDB: Starting recovery for XA transactions...
110128 12:17:37  InnoDB: Transaction 6E7 in prepared state after recovery
110128 12:17:37  InnoDB: Transaction contains changes to 1 rows
110128 12:17:37  InnoDB: 1 transactions in prepared state after recovery
110128 12:17:37 [Note] Found 1 prepared transaction(s) in InnoDB
110128 12:17:37 [Note] rollback xid 'MySQLXid\1\0\0\0x\3\0\0\0\0\0\0'

rollback input formatid== 1, gtrid_length== 20, bqual_length== 0, data== MySQLXid^A.
rollback get trx formatid== 1, gtrid_length== 20, bqual_length== 0, data== MySQLXid^A.

110128 12:17:37 [Note] Crash recovery finished.
110128 12:17:37 [Note] Event Scheduler: Loaded 0 events

We can infer that it is a innodb problem base on the above analyse.
Marko, please take over the bug if the inference is reasonable. Thanks!

The problem can be reproduced the attached test file.

Attachment: binlog_bug59622.test (application/octet-stream, text), 151.51 KiB.

s/ reproduced / reproduced by

With some help from Luis, I tweaked the binlog_bug59622.test so that I can reproduce the bug. I am now reducing the test case.

The difference is on table t3. Curiously, if I replace

CREATE TABLE t0 SELECT * FROM t10;
RENAME TABLE t0 TO t3;

with

CREATE TABLE t3 SELECT * FROM t10;

then everything seems to be fine.

Reduced test case (fails some of the time)

Attachment: innodb_bug59622.test (application/octet-stream, text), 3.01 KiB.

During the binlog XA recovery, there is a delete-marked duplicate entry for SYS_TABLES.ID=0x11 that points to test/t0. Its successor entry is not delete-marked and points to test/t3.

We could work around this inconsistency by advancing to the next matching record in dict_load_table_on_id() as long as the first matching records are delete-marked. Because the clustered index on SYS_TABLES.NAME is definitely unique, there is no need for a similar workaround in dict_load_table().

I will have to find out why the duplicate entry is possible and if it will eventually be purged. The index on SYS_TABLES.ID (DICT_TABLE_IDS_ID) is supposed to be unique.

The immediate bug is that dict_load_table_on_id() fails to skip matching but delete-marked records. If the function is invoked before the purge gets a chance to run, it may fail to load the table.

While testing this, I noticed a lot of garbage (delete-marked records) in the index ID_IND on SYS_TABLES.ID that was never purged. This appears to be caused by row_ins_must_modify(), which returns 0 when it should return ROW_INS_NEXT. The only caller of row_ins_must_modify() is prepared for the ROW_INS_NEXT return value, performing an insert as update-in-place when cursor->up_match >= enough_match.

Looking at the source code, both bugs have been there ever since MySQL 3.23.53. Next, I will try to repeat this on MySQL 5.5 and 5.1.

The dict_load_table_on_id() bug was fixed in Bug #53756 in MySQL 5.1 (built-in and InnoDB plugin), but for some reason it was apparently not merged to 5.5 or trunk. This explains why the test case is not repeatable in 5.1 but is in 5.5.

The delete-marked garbage is there in 5.1 built-in and 5.1 plugin and probably all versions of InnoDB. I will filed Bug #60049 for that.

Jimmy, please find out why the merge of Bug #53756 was lost in 5.5 and trunk, and merge it again.

This is a duplicate of Bug #53756. Let us track the lost merge there.