MySQL Bugs: #55283: assert trx0roll.c undo_no + 1 == trx->undo_no in trx_rollback_or_clean_all

Bug #55283	assert trx0roll.c undo_no + 1 == trx->undo_no in trx_rollback_or_clean_all_recov
Submitted:	15 Jul 2010 13:36	Modified:	17 Aug 2010 13:06
Reporter:	Mikhail Izioumtchenko	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB Plugin storage engine	Severity:	S3 (Non-critical)
Version:	5.1,5.5	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
the following rarely occurring UNIV_DEBUG
assert in txn rollback after crash recovery:

100714 19:33:12  InnoDB: Assertion failure in thread 1095674176 in file /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/trx/trx0roll.c line 980
InnoDB: Failing assertion: undo_no + 1 == trx->undo_no

 #5  0x0000003acbc31d10 in abort () from /lib64/libc.so.6
 #6  0x0000000000893dc4 in trx_roll_pop_top_rec_of_trx (trx=0x14355ef8,
     limit=0, roll_ptr=0x414e9f50, heap=0x2aabc800d570)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/trx/trx0roll.c:980
 #7  0x000000000095a189 in row_undo (node=0x2aabc809fa58, thr=0x2aabc808c4f8)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/row/row0undo.c:264
 #8  0x000000000095a42b in row_undo_step (thr=0x2aabc808c4f8)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/row/row0undo.c:367
 #9  0x0000000000951d87 in que_thr_step (thr=0x2aabc808c4f8)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/que/que0que.c:1245
 #10 0x0000000000951f9e in que_run_threads_low (thr=0x2aabc808c4f8)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/que/que0que.c:1310
 #11 0x00000000009520b5 in que_run_threads (thr=0x2aabc8090fe8)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/que/que0que.c:1347
 #12 0x0000000000893066 in trx_rollback_active (trx=0x14355ef8)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/trx/trx0roll.c:472
 #13 0x00000000008933be in trx_rollback_or_clean_recovered (all=1)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/trx/trx0roll.c:585
 #14 0x0000000000893491 in trx_rollback_or_clean_all_recovered (
     arg=0x7fff94bf941c)
     at /spare2/mizioumt/ctc/mysql_src_c55/storage/innobase/trx/trx0roll.c:621

980             ut_ad(undo_no + 1 == trx->undo_no);
(gdb) p undo_no
$1 = 26
(gdb) p trx->undo_no
$2 = 28

the assert then repeats itself in crash recovery making the database unrecoverable.

How to repeat:
standard stress/recovery test (srcover.sh detach kbs 4).
Reproducibility is very low with 5.1 Plugin, somewhat higher
with 5.5.

I could not find any obvious error, other than missing assertions. It seems to me that trx->undo_no should be a monotonically growing sequence for a given transaction. It should be incremented by 1 every time a record is added to the transaction's update_undo or insert_undo logs.

The function trx_undo_arr_store_info() should really never fail. I suspect that the reason of the crash is that there are two undo log records for the transactions with the same undo_no.

Michael, can you please try this patch? (You might want to keep it in all runs with all patches you are testing, for increased coverage.)

=== modified file 'storage/innobase/trx/trx0roll.c'
--- storage/innobase/trx/trx0roll.c	revid:marko.makela@oracle.com-20100810102248-xrdjie2de4r9h3ed
+++ storage/innobase/trx/trx0roll.c	2010-08-11 13:09:48 +0000
@@ -941,6 +941,7 @@ try_again:
 	} else if (upd_undo->top_undo_no > ins_undo->top_undo_no) {
 		undo = upd_undo;
 	} else {
+		ut_a(upd_undo->top_undo_no < ins_undo->top_undo_no);
 		undo = ins_undo;
 	}
 
@@ -1009,6 +1009,7 @@ try_again:
 
 		mtr_commit(&mtr);
 
+		ut_error;
 		goto try_again;
 	}

Thanks, Michael! Can you please replace the failing assertion
ut_ad(undo_no + 1 == trx->undo_no);
with
ut_a(undo_no + 1 == trx->undo_no);
to improve coverage (in builds without UNIV_DEBUG)?

This is a long shot, but could this share a common cause with Bug #55284?

I understood that MySQL can automatically issue XA ROLLBACK (and possibly XA COMMIT) on crash recovery, without user intervention. If this can have been the case in the tests so far, then the bug could be the result of the MySQL binlog and the InnoDB transactions getting out of sync. I am thinking of a scenario where mysqld is killed and restarted after very short uptime.

from the data currently in the bug, binlog is a prerequisite for the failure.
I don't see the correlation with other potentially related parameters such as
binlog format, sync_binlog, transaction isolation.