MySQL Bugs: #54854: Can't find good position for replication break between DDL statements

Bug #54854	Can't find good position for replication break between DDL statements
Submitted:	28 Jun 2010 7:40	Modified:	18 Apr 2012 13:38
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Replication	Severity:	S3 (Non-critical)
Version:	mysql-5.1-telco-7.0	OS:	Linux
Assigned to:	Martin Zaun	CPU Architecture:	Any
Tags:	mysql-cluster-7.0.10

Description:
Cluster replication 'epochs' are tied to DML operations only, no epochs are generated for DDL statements. So if a replication master dies between two DML
statements it is not possible to retrieve the exact log position to continue replication from on the 2nd replication channels master. It is only possible to find the end position for the last DML operation before and the first DML operation after a block of DDL operations.

How to repeat:
Perform the following sequence of statements and kill the primary replication channel master in between the two DML statements:

BEGIN;
INSERT INTO cluster_table ...
COMMIT;

CREATE TABLE ...;
-- kill primary channel master here
CREATE TABLE ...;

BEGIN;
INSERT INTO cluster_table ...
COMMIT;

The ndb_binlog_index table will now show epoch entries for both INSERT transactions but none for the CREATE TABLE statements. Replication should fail over to the secondary channel master at the log position of the 2nd CREATE statement, this can't be retrieved using ndb_apply_status/ndb_binlog_index though.

Suggested fix:
Generate extra epochs for each DDL statement so that the exact position to continue from can always be determined

Workarounds:

- start replication at end position of last known applied epoch, skip over failed DDL statements 

- only perform DDL operations on the mysql server that is the master for the active replication channel

The replication channel fail over procedure as documented in 

  http://dev.mysql.com/doc/mysql-cluster-excerpt/5.1/en/mysql-cluster-replication-failover.h...

will always lead to DDL operations that happened right after the last applied transaction being skipped.

Bug #22296 was marked as duplicate of this one.

Fixed in 7.0.33/7.1.22/7.2.6.

Documented fix as follows in the NDB-7.0.33, 7.1.22, and 7.2.6 changelogs:

        DDL statements could sometimes be missed during replication
        channel cutover, due to the fact that there may not be any
        epochs following the last applied epoch when the slave is up to
        date and no new epoch has been finalized on the master. Because
        epochs are not consecutively numbered, there may be a gap
        between the last applied epoch and the next epoch; thus it is
        not possible to determine the number assigned to the next epoch.
        This meant that, if the new master did not have all epochs, it
        was possible for those epochs containing only DDL statements to
        be skipped over.

        The fix for this problem includes modifications to mysqld binary
        logging code so that the next position in the binary log
        following the COMMIT event at the end of an epoch transaction,
        as well as the addition of next_file and next_position columns
        to the mysql.ndb_binlog_index table. In addition, a new
        replication channel cutover mechanism is defined that employs
        these new columns. To make use of the new cutover mechanism, it
        is necessary to modify the query previously used to obtain the
        start point; in addition, to simplify prevention of possible
        errors caused by duplication of DDL statements, a new shorthand
        value ddl_exist_errors is implemented for use with the mysqld
        option --slave-skip-errors.

        For more information, see "Implementing Failover with MySQL
        Cluster Replication".

        Note that the existing replication channel cutover mechanism
        continues to function as before, including the same limitations
        described previously.

Also updated MySQL Cluster Replication and mysqld server option
documentation.

Closed.