Bug #6522 Replication fails due to a rolled back transaction in the binlog
Submitted: 9 Nov 2004 15:00 Modified: 10 Nov 2004 15:15
Reporter: Tony Little
Status: Closed
Category:Server: Replication Severity:S2 (Serious)
Version:4.0.18 OS:Linux (linux)
Assigned to: Guilhem Bichot Target Version:

[9 Nov 2004 15:00] Tony Little
Description:
On the master the BEGIN statement fails and is written to the binlog with an error.  All
subsequent queries that are a part of the transaction are then written to the binlog with
no errors and the ROLLBACK is written to the binlog with an error.  Both the BEGIN and
ROLLBACK statements are written with an error but all in between are written to the
binlog with no error.  The slaves then all fail with an error message like:

query partially completed on the master (error on master: 1158) and was aborted. There is
a chance that your master is inconsistent at this point. If you are sure that your master
is ok, run this query manually on the slave and then restart the slave with SET GLOBAL
SQL_SLAVE_SKIP_COUNTER=1; START SLAVE; . Query: 'BEGIN'

The problem is if a transaction fails shouldn't the entire seqence be left out of the
binlog.  This has happened twice to our server and it causes all replication servers to
fail as a result.  The only way to recover is to start the slaves back up after changing
their master log position using CHANGE MASTER TO to the query following the ROLLBACK that
failed.

None if the queries between the bad BEGIN and ROLLBACK are added to the master database
and it continues to function as normal.

How to repeat:
I am not sure of the exact sequence that causes this to happen.  Obviously the BEGIN is
failing on the master.  I don't know the reason.

Suggested fix:
Don't write any part of a failed transaction to the binlog.
[9 Nov 2004 16:12] Guilhem Bichot
Hi,
If BEGIN/ROLLBACK got into the binlog, it's likely that one table updated by the
transaction was _not_ a transactional table. You can check this by typing SHOW TABLE
STATUS LIKE 'each_table_of_the_transaction'; what do you get?
Error 1158 means that there was interruption of the client/server communication.
So, what probably happened is that the connection started a transaction, updated some
non-transactional table (like MyISAM, HEAP, MERGE...), then communication got interrupted
so the MySQL database server had to rollback. But as a non-transactional table was
updated, not writing anything to the binlog would not reflect those updates; those are
better reflected by logging BEGIN; transaction; ROLLBACK;
Now you are right, there is no reason 1158 found its way into binlog; as server got the
error after executing the updates, logged error should be 0 (no error). I'm fixing this
now.
[9 Nov 2004 16:38] Tony Little
I double checked the entire transaction and all tables used in the queries are INNODB.  No
matter what, setting the error to 0 for the BEGIN and ROLLBACK in the binlog would take
care of the issue.
[10 Nov 2004 15:15] Guilhem Bichot
Ah, tables are InnoDB... I'd be interested in seeing the output of mysqlbinlog (one
BEGIN/ROLLBACK block), and in knowing if any involved table was created with the
TEMPORARY word (CREATE TEMPORARY TABLE).
I just changed things so that an error code of 0 is logged for BEGIN/ROLLBACK/COMMIT, in
the next 4.0.
Thank you for your bug report.
ChangeSet@1.2070.1.1, 2004-11-10 15:07:55+01:00, guilhem@mysql.com