Bug #88595 Row-based master-master replication broken by add column or drop column
Submitted: 21 Nov 2017 22:12 Modified: 31 Jan 15:58
Reporter: Eric Rasmussen Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Row Based Replication ( RBR ) Severity:S2 (Serious)
Version:5.6.37 OS:Any
Assigned to: CPU Architecture:Any

[21 Nov 2017 22:12] Eric Rasmussen
Description:
When performing a schema update that adds or drops a column in the MIDDLE of a table, it can break row-based Master-Master replication.  I didn't find this documented anywhere, so hence the ticket.

This is related to, but different than, https://bugs.mysql.com/bug.php?id=74927 and https://bugs.mysql.com/bug.php?id=74927.  

In this case, there is no misalignment between the Master-Master servers in terms of MySQL versions, OS, storage engines, replicated tables, or character sets.  We are running MySQL 5.6.37 on CentOS with utf8 charset and InnoDB tables on both servers in a master-master configuration using row-based replication.

This is an example of the error reported by show slave status when this condition occurs:

Last_Errno: 1677
Last_Error: Column 74 of table 'db.table' cannot be converted from type 'int' to type 'tinyint(1) unsigned'

The problem is exacerbated when alter table operations take a long time, such as due to large table size, and/or when there are network delays in replication between the two Master servers.  The trigger for replication failure (or corruption) is a record in the affected table being updated on the 2nd Master before it has completed applying the alter table statement received from the 1st Master.

The root cause appears the design of the row-based replication format, which propagates data by column order rather than by column name.  In the case of traditional Master->Slave replication, existing RBR is safe because the single-threaded behavior of the binlog prevents misaligned data from propagating.  However, it is fragile in a Master-Master setup.

There is a workaround, but it limits table design, so I'm really hoping you can fix this.

How to repeat:
Scenario 1 - Alter Table with ALGORITHM=INPLACE (Default)
----------
Master 1 - Add column to middle of table1 / Drop column from middle of table1
Master 1 - Alter Table completes, command is added to Master Binary Log
Master 2 - Receives Alter Table command, starts executing
Master 2 - While Alter Table is running, update record in table1 -> record update added to Master Binary Log
Master 1 - Master 1 receives record update with legacy column order -> replication stops with error 1677 if column types don't align
           (or worse) Master 1 receives record update with legacy column order -> data silently written to wrong columns

Scenario 2 - ALTER TABLE with ALGORITHM=COPY
----------
Master 1 - Add column to middle of table1 / Drop column from middle of table1
(Network lag happens between Master 2 -> Master 1 before Alter Table command completes)
Master 1 - Alter Table completes, command is added to Master Binary Log
Master 2 - Update record in table1
(Network lag resolved)
Master 2 - Receives Alter Table command, starts executing
Master 1 - Master 1 receives record update with legacy column order -> replication stops with error 1677 if column types don't align
           (or worse) Master 1 receives record update with legacy column order -> data silently written to wrong columns

Suggested fix:
Add a new option for binlog_row_image that includes column names in the master binlog, and update the slave SQL thread to use column names when available.

Workaround - Add/Drop columns ONLY at the END of the table
--------
Master 1 - Add column to END of table1 / Drop column from END of table1
Master 1 - Alter Table completes, command is added to Master Binary Log
Master 2 - Receives Alter Table command, starts executing
Master 2 - While Alter Table is running, update record in table1 -> record update added to Master Binary Log
Master 1 - Master 1 receives record update with one missing / one extra column at END of table -> update applied with no error

Reason Workaround works
------
Row-based replication logs data by column order, NOT by column name, but the slave SQL thread seems to silently ignore extra/missing columns at the end of the RBR parameters.
[22 Dec 2017 11:42] Bogdan Kecman
Hi,

Not sure if this is a replication bug or documentation bug (that it's not documented that we have this limitation), but I'll let replication team sort it out, in any how the bug is verified.

all best
Bogdan
[31 Jan 15:58] Eric Rasmussen
I recently discovered the following statement while looking at the known limitations of Group Replication (a feature not available until MySQL 5.7):

https://dev.mysql.com/doc/refman/5.7/en/group-replication-limitations.html

"Concurrent DDL versus DML Operations.  Concurrent data definition statements and data manipulation statements executing against the same object but on different servers is not supported when using multi-primary mode. During execution of Data Definition Language (DDL) statements on an object, executing concurrent Data Manipulation Language (DML) on the same object but on a different server instance has the risk of conflicting DDL executing on different instances not being detected."

That appears to accurately describe the issue encountered here.  I suggest this be added to these pages (and similar for older versions):

https://dev.mysql.com/doc/refman/5.7/en/replication-sbr-rbr.html

https://dev.mysql.com/doc/refman/5.6/en/replication-sbr-rbr.html