MySQL Bugs: #29020: Event results not correctly replicated to slave in RBR

Bug #29020	Event results not correctly replicated to slave in RBR
Submitted:	11 Jun 2007 16:13	Modified:	31 Mar 2008 19:09
Reporter:	Chuck Bell	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S2 (Serious)
Version:	5.1.20	OS:	Any
Assigned to:	Mats Kindahl	CPU Architecture:	Any

Description:
Events executed during testing occassionally do not replicate to the slave. The test rpl_invoked_features includes two events (e1, e11) that are executed every 1 second. The test includes the block:

SET GLOBAL EVENT_SCHEDULER = on;
# Wait 2 sec while events will executed
--sleep 2
SET GLOBAL EVENT_SCHEDULER = off;

Which should allow the events to fire twice. However, occassionally one of the events fires on the master but the result is not replicated to the slave. This causes the later portion of the test to fail on the diff. 

--exec $MYSQL_DUMP --compact --order-by-primary --skip-extended-insert --no-create-info test > $MYSQLTEST_VARDIR/tmp/rpl_invoked_features_master.sql
--exec $MYSQL_DUMP_SLAVE --compact --order-by-primary --skip-extended-insert --no-create-info test > $MYSQLTEST_VARDIR/tmp/rpl_invoked_features_slave.sql
--diff_files $MYSQLTEST_VARDIR/tmp/rpl_invoked_features_master.sql $MYSQLTEST_VARDIR/tmp/rpl_invoked_features_slave.sql

In this case, the files in /var/tmp will show the master to have one more row in the table than the slave (from either e1 or e11).

How to repeat:
Run the test rpl_invoked_features in a loop until the test fails.

Suggested fix:
While not a solution, the search should start by answering the question: What is causing the event results to not be written to the binary log?

Problem has been isolated to interleaved writing of binary log on master. This is causing the master to generate events that are out of order which in turn can cause the slave to skip rows or crash. Cause is still under investigation. Need more help from Mats. Plan to talk with Mats when he returns from vacation.

Problem has been isolated to a possibly a premature flush of the binary log on the master. The master can produce the following invalid sequence:

# at 960
# at 1006
# at 1051
#786 10:43:10 server id 1  end_log_pos 1006 
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags 
#      3c0 7e 33 b7 46   13   01 00 00 00   2e 00 00 00   ee 03 00 00   00 00
#      3d3 11 00 00 00 00 00 00 00  04 74 65 73 74 00 03 74 |.........test..t|
#      3e3 31 31 00 03 03 03 0f 02  40 00 06  |11.........|
# 	Table_map: `test`.`t11` mapped to number 17
#786 10:43:10 server id 1  end_log_pos 1051 
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags 
#      3ee 7e 33 b7 46   13   01 00 00 00   2d 00 00 00   1b 04 00 00   00 00
#      401 10 00 00 00 00 00 00 00  04 74 65 73 74 00 02 74 |.........test..t|
#      411 31 00 03 03 03 0f 02 40  00 06  |1.........|
# 	Table_map: `test`.`t1` mapped to number 16
#786 10:43:10 server id 1  end_log_pos 1090 
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags 
#      41b 7e 33 b7 46   17   01 00 00 00   27 00 00 00   42 04 00 00   10 00
#      42e 11 00 00 00 00 00 01 00  03 ff f8 0a 00 00 00 0a |................|
#      43e 00 00 00 00  |....|
# 	Write_rows: table id 17 flags: STMT_END_F

BINLOG '
fjO3RhMBAAAALgAAAO4DAAAAABEAAAAAAAAABHRlc3QAA3QxMQADAwMPAkAABg==
fjO3RhMBAAAALQAAABsEAAAAABAAAAAAAAAABHRlc3QAAnQxAAMDAw8CQAAG
fjO3RhcBAAAAJwAAAEIEAAAQABEAAAAAAAEAA//4CgAAAAoAAAAA
'/*!*/;
# at 1090
#786 10:43:10 server id 1  end_log_pos 1129 
# Position  Timestamp   Type   Master ID        Size      Master Pos    Flags 
#      442 7e 33 b7 46   17   01 00 00 00   27 00 00 00   69 04 00 00   10 00
#      455 10 00 00 00 00 00 01 00  03 ff f8 0a 00 00 00 0a |................|
#      465 00 00 00 00  |....|
# 	Write_rows: table id 16 flags: STMT_END_F

BINLOG '
fjO3RhcBAAAAJwAAAGkEAAAQABAAAAAAAAEAA//4CgAAAAoAAAAA
'/*!*/;

When this sequence is read by the slave, the second write_row event is not executed because the matching table map has been deleted.

RPD: REFINED PROBLEM DESCRIPTION
================================
A non-transactional statement is, in RBR, translated into a table map
event followed by one or more row events.  The writing of these events
to binlog is not atomic.  This makes it possible for another thread
to interleave its events with these events.  The interleaved binlog is
not interpretable by the slave (the slave discards all table maps when
it gets a STMT_END_F).

POSSIBLE SOLUTIONS
==================
1. KEEP MUTEX 

   Add parameter to MYSQL_BIN_LOG::write() function to that it does
   not unlock the mutex.  Keep the mutex until STMT_END_F.  (Also
   need a parameter to say that the lock should not be taken.)

   NEGATIVE:
   This could lock the binlog for a long time if the statement
   takes long to execute.

   *Not acceptable to lock binlog while something is executing*

2. KEEP MUTEX, SPLIT STATEMENTS INTO MULTIPLE BINLOG GROUPS

   We could take alternative 1 and add STMT_END_F on *every* row event
   to ensure that the statement does not lock the binlog for a very
   long time.  This implies that we need a table map written per row
   event.

   NEGATIVE:
   This also implies that a statement is split into groups that can be
   separately executed in the slave.  This effectively makes 
   non-transaction tables non-atomic and *non-isolatable* in the binlog.
   Note that MyISAM do support *isolation* via locks.

   *Probably not acceptable to break isolation for non-trx tables*

3. Whenever we do a table map, we pretend that this is a transactional
   event (set a flag) and thus this event and coming row events will
   be cached in transactional cache.

   NEGATIVE:
   Assume that you execute:

   t1: BEGIN
   t1: INSERT into trx-table ...
   t1: INSERT into non-trx-table ...  (*)
   t2: INSERT into non-trx-table ...  (**)
   t2: COMMIT
   t1: COMMIT

   Then t2 will be logged before t1 even though (*) was executed before (**).

   (Btw, this problem is already mentioned in BUG#28976 and WL#2687.)

SELECTED SOLUTION
=================
For 5.1.22, we will most likely use the 3rd solution described above.

ARCHITECTURAL PRINCIPLES FOR FUTURE (ROUGH IDEAS)
=================================================
In the future (5.2 or later), we need to refactor code to make this
easier to handle (and to implement WL#2687).  Probably split code 
into two layers:

1. CACHING LAYER
----------------
Caching of caching unit (transactions, could possibly be statements
also) in threads.  Needs to somehow handle nested transactions in two
levels -- transactions and non-trx statements.

Uses Binlog layer interface.

2. BINLOG LAYER
---------------
Binlog only responsible for storing groups of events ("grouping" is to
be done at higher architectural level).  Interface "only" has functions:
- start group, 
- end group
- add event

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/33159

ChangeSet@1.2564, 2007-08-27 18:44:13+04:00, skozlov@mysql.com +3 -0
  WL#3694 Replication of Invocation and Invoked Features Testing.
  Bug#29020 Event results not correctly replicated to slave in RBR: partially.

I was working on two ideas. One a rehash of my original attempt and the other a meld of ideas from you and Andrei.

Idea 1
------
Use a thread local variable as a switch for all of the logging methods that take and release locks on LOCK_log. Let's call the variable thd->use_binlog_locks. The default value for this variable is "TRUE" so the methods will by default work as originally coded. However, if we detect a table map being created for a non-transactional operation we set the thread local variable to "FALSE" and take the lock on LOCK_log. We then release it after the row is written setting the variable back to "FALSE." This would in effect extend the mutex to cover the table map and write row event. But it would also make for a (relatively) long lock.

Idea 2
------
Use a thread local variable to hold the table map until the write row is generated. That is, detect when we need to delay writing the table map, store it, then write it later when the write row is generated. This would inside the existing pthread_mutex_lock(&LOCK_log) / pthread_mutex_unlock(&LOCK_log) for writing the write row event. A thread local variable was used to tell the code that there was a saved table map to write.

I was in the middle of doing this experiment so I have it partially working. The one last problem to solve was trying to figure out the best way to save and the table map event. I tried originally to save a pointer to the table map event, but this led to conflicts later on in the mysql_bin_log.write(ev) method. It seemed that something was being overwritten and/or reset. The write(ev) call to write the saved table map worked, but I was getting memory access violations on the write(ev) call to write the write row event. Another method was simply delaying the call to write_table_map() but that meant storing or gaining access to the table. I also considered whether we need to save multiple table maps in which case we would need a list of some sort.

Random thoughts
---------------
I wonder if the interleave condition can happen on a single processor CPU or if it is far less likely to occur? Both of my machines are dual core processors.

Sometimes one can run the 100 iterations of rpl_bug_29020 (as I've written it) and the code will not interleave for some time (on at least one occasion not at all). Other times it interleaves on the first couple of passes.

Linux and Windows behave differently under the rpl_bug_29020 test. It seems Windows is more sensitive to interleaving that Linux. ...or that just may be coincidence or plain bad luck.

Serge has modified the rpl_invoked_features test to work with the old code (see http://lists.mysql.com/commits/33158). We need to ensure the solution works with his patch and that if there are any additional changes (like restoring the test to execute the events asynchronously) to the test be made in this patch.

With comments in review mail about "exec rm" => "remove_file"

The fix for this bug requires that BUG#12713 is fixed (in order to avoid duplicate work), hence this bug is put in verified state awaiting fix for BUG#12713.

Pushed into 5.1.23-rc

Pushed into 6.0.4-alpha

This bug couldn't be fixed until bug#12713 was fixed.
That bug is queued in the Runtime tree, with a push
imminent.

Setting ETA based on Mats' comment to Trusy:
"Once the patch for BUG#12713 is in the replication tree, 
it takes one to two days to fix the bug.

The problem, as stated, is in lack of caching of row-based events on a not-transaction table. In this case, the table map can be flushed separately from the following Rows_log events.
To follow up RPD, and its p.3 "Whenever we do a table map, we pretend that this is a transactional event", i think there is no need "to pretend" as indeed the table map and the rest of Rows_log events are a form of transaction (ta) regardless of the table type.
It should be safe (to the limits set by bug#28976) and effective (eliminates the current issue) to exploit the existing transaction cache in either case of

1. a single query on a non-ta table generates the table map and the rest Rows_log events;
2. a mixed multi-statement ta executes a query on a non-ta table.

I.e regardless of a table type, the table map should be cached first as well as the following events, and the cache is flushed at commit time.

Notice that in case of 2. there is an outstanding
Bug#28976 Mixing trans and non-trans tables in one transaction results in incorrect binlog.
The latter is of a general character including statement based
replication. Imo that problem stands individually and implementing the 3. of RPD does not make things any worse in that regard.
In my understanding the NEGATIVE example - which is supposed to be about two
concurrent multi-statement ta:s modified a not-ta table in reverse to commit order - can not oppose to the idea to piggyback on the existing transaction cache for table map and events on a non-ta table.
The "contra" example is actually an example of the bug#28976's problem - which is in my opinion unsolvable (welcome to the bug page to follow with discussion).

Getting back to exploiting the transaction cache.

As there is a good option to resolve bug#28976 with immediate binlogging
of row-based events wl#2687, the thing to do wrt to the current bug could be to introduce a specific row-based-event-on-non-ta-table cache, instead of the reuse, which solves the two bugs issues.

The property of such cache:

1. it lists row-based events of a non-ta query starting from the Table_map
2. it is flushed to the binlog and reset at the end of the query (e.g thd->binlog_event).

provides with the atomicity (this bug#29020 problem) and a solution of a causality issue of Bug#28976.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44375

ChangeSet@1.2559, 2008-03-25 10:39:27+01:00, mats@mats-laptop.(none) +29 -0
  BUG#29020 (Event results not correctly replicated to slave in RBR):
  
  Bug is fixed by writing non-transactional events to the transaction cache and
  flushing the cache to the binary log at statement commit. To mimic the behavior
  of normal statement-based replication, we flush the transaction cache in row-
  based mode when there is no committed statements in the transaction cache,
  which means we are committing the first one. This means that it will be written
  to the binary log as a "mini-transaction" with just the rows for the statement.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44577

ChangeSet@1.2559, 2008-03-28 12:30:28+01:00, mats@mats-laptop.(none) +33 -0
  BUG#29020 (Event results not correctly replicated to slave in RBR):
  
  The bug allow multiple executing transactions working with non-transactional
  to interfere with each others by interleaving the events of different trans-
  actions.
  
  Bug is fixed by writing non-transactional events to the transaction cache and
  flushing the cache to the binary log at statement commit. To mimic the behavior
  of normal statement-based replication, we flush the transaction cache in row-
  based mode when there is no committed statements in the transaction cache,
  which means we are committing the first one. This means that it will be written
  to the binary log as a "mini-transaction" with just the rows for the statement.
  
  Note that the changes here does not take effect when building the server with
  HAVE_TRANSACTIONS set to false, but it is not clear if this was possible before
  this patch either.
  
  In addition, the code now always generate a BEGIN/COMMIT pair for transactions
  (or BEGIN/ROLLBACK pair in the case of non-transactional changes in a statement
  that was rolled back).

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44578

ChangeSet@1.2559, 2008-03-28 13:16:41+01:00, mats@mats-laptop.(none) +33 -0
  BUG#29020 (Event results not correctly replicated to slave in RBR):
  
  The bug allow multiple executing transactions working with non-transactional
  to interfere with each others by interleaving the events of different trans-
  actions.
  
  Bug is fixed by writing non-transactional events to the transaction cache and
  flushing the cache to the binary log at statement commit. To mimic the behavior
  of normal statement-based replication, we flush the transaction cache in row-
  based mode when there is no committed statements in the transaction cache,
  which means we are committing the first one. This means that it will be written
  to the binary log as a "mini-transaction" with just the rows for the statement.
  
  Note that the changes here does not take effect when building the server with
  HAVE_TRANSACTIONS set to false, but it is not clear if this was possible before
  this patch either.
  
  For row-based logging, we also have that when AUTOCOMMIT=1, the code now always
  generates a BEGIN/COMMIT pair for single statements, or BEGIN/ROLLBACK pair in the
  case of non-transactional changes in a statement that was rolled back. Note that
  for the case where changes to a non-transactional table causes a rollback due
  to error, the statement will now be logged with a BEGIN/ROLLBACK pair, even
  though some changes has been committed to the non-transactional table.

Approved.

Bugfix documented in the 5.1.23-ndb-6.3.11 changelog as follows:

        The effects of scheduled events were not always correctly reproduced on
        the slave when using row-based replication.

Left status as Patch Queued pending merges to mainline trees.

Pushed into 5.1.24-rc

Also noted in in 5.1.24 changelog.

Pushed into 6.0.5-alpha

Also noted in 6.0.5 changelog.