Bug #33849 COMMIT event missing in cluster circular replication.
Submitted: 13 Jan 2008 17:40 Modified: 27 Jun 2008 10:31
Reporter: Konstantin Osipov (OCA) Email Updates:
Status: Closed Impact on me:
Category:MySQL Cluster: Replication Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: Andrei Elkin CPU Architecture:Any

[13 Jan 2008 17:40] Konstantin Osipov
Sometimes cluster replication fails to insert the statement-end COMMIT event when replicating circularly to the original master.
This does not lead to any user-visible behavior change, at least not immediately,
but is an internal bug and violates the internal invariant.
Setting to 'Verified' right away, since I'm going to disable rpl_ndb_circular
and rpl_ndb_circular_simplex after reporting this bug.

The bug looks very similar to Bug#25688

How to repeat:
Patch the server this way:
===== sql_parse.cc 1.714 vs edited =====
--- 1.714/sql/sql_parse.cc      2007-12-17 16:28:05 +03:00
+++ edited/sql_parse.cc 2007-12-24 05:12:58 +03:00
@@ -3737,6 +3737,7 @@ end_with_restore_list:
+    DBUG_ASSERT(thd->lock == NULL);
     if (end_trans(thd, lex->tx_release ? COMMIT_RELEASE :
                               lex->tx_chain ? COMMIT_AND_CHAIN : COMMIT))
       goto error;

Run either rpl_ndb_circular or rpl_ndb_circular_simplex -- both fail this assert sporadically.

Suggested fix:
What I was able to establish so far is:

 * the failure happens on the master, in the slave SQL thread,
   when it is executing the binary log which has records injected
   from the slave through ndb injector thread. gdb window is master_0.
 * Write_rows_log_event doesn't close ndb_apply_status and t1
   tables because STMT_END_F is missing from the last Write_rows_log_event
 * how it happens: since the flag is missing,
   Rows_log_event::do_update_pos does not call rli->cleanup_context()
 * this leads to tables being left unlocked for the next binlog event,
    which happens to be Query_log_event("COMMIT");
 * since there is close_thread_tables(thd) in the end of
   Query_log_event::do_apply_event, the bug used to go unnoticed
   because any Query_log_event that doesn't use tables would
   implicitly close the any tables left from the previous event
[27 Jun 2008 10:31] Konstantin Osipov
The bug was fixed by the fix for Bug#36197
[27 Jun 2008 12:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:


2656 Konstantin Osipov	2008-06-27
      As per agreement with Tomas and Danny, move some rpl_ndb tests
      to rpl_ndb_big test suite. This test suite is not part of the suite that
      is run by default, but will be run by pushbuild in rpl, ndb and 
      telco trees, and also in some fast hosts in team trees.
      rpl_ndb_big takes half an hour to run.
      Enable rpl_ndb_circular and rpl_ndb_circular_simplex while
      we are at it -- Bug#33849 was fixed.