Bug #23171 Illegal slave restart (group) position
Submitted: 11 Oct 2006 10:09 Modified: 12 Apr 2007 16:04
Reporter: Lars Thalmann Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S3 (Non-critical)
Version:5.1 OS:Any
Assigned to: Mats Kindahl
Tags: bfsm_2007_03_01, sr5_1

[11 Oct 2006 10:09] Lars Thalmann
Description:
PROBLEMS
--------

1. If some events are skipped due to slave_skip_events, the group
   position is changed.  This can create an illegal group position.

2. If many skipped events (due to same server id of event and slave),
   then with the fix of BUG#21474, we need to reskip all of these events.
   This can make the relay log long and also consumes some time when the
   slave server is restarted.

SOME BACKGROUND
---------------

There are three types of execution groups in the slaves:

- BEGIN-COMMIT transactions
- Table maps followed by rows events
- Intvar events followed by the statement it affects

At all times, the group position (the position stored in
relay_log.info) must be correct so that a restarted slave can start
from that position.

How to repeat:
Code review

Suggested fix:
1. Change group position to updated for skipped events in the 
   same way as for normal events.
2. Make sure that group position is not updated to often.
[12 Oct 2006 15:59] Lars Thalmann
See also BUG#12691
[12 Oct 2006 17:25] Guilhem Bichot
Lars,
this comment is wrong, but the code is correct. If the BEGIN is in an old relay log, the group position will point to this old relay log so it will not be purged until the COMMIT is executed.
[25 Oct 2006 5:04] Wei Li
I found the following situation today and hope the solution can also address the issue.

In 5.1, if the slave does this in auto commit mode:
insert into t1 values(110);
insert into t2 values(120);

Mysql will generate the following master-log:
# at 1549
#061024 15:34:07 server id 1  end_log_pos 90    Query   thread_id=177
 exec_time=0     error_code=0
SET TIMESTAMP=1161729247;
insert into t1 values (110);
# at 1639
#061024 15:34:07 server id 1  end_log_pos 1666  Xid = 896
COMMIT;
# at 1666
#061024 16:22:46 server id 1  end_log_pos 90    Query   thread_id=177
 exec_time=0     error_code=0
SET TIMESTAMP=1161732166;
insert into t1 values (120);
# at 1756
#061024 16:22:46 server id 1  end_log_pos 1783  Xid = 897
COMMIT;

The slave execution sequence will be:
insert ...
commit ...
insert ...
commit ...

What happens is that the begin statement is missing so that insert
event does not have the correct master-log position.  If insert event
is executed, Log_event::exec_event() will be called.  Because BEGIN
bit is not set, flush_relay_log_info(rli) will always be called.  So,
mysql 5.1 is writing the incorrect master_log_position into the
relay-log.info.
[1 Nov 2006 14:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/14688

ChangeSet@1.2310, 2006-11-01 15:35:37+01:00, mats@romeo.(none) +3 -0
  BUG#23171 (Illegal slave restart group position):
  First patch preparing for restructuring the event execution and event
  skipping. This patch renames the existing (virtual) function exec_event()
  to be a primitive for implementing the real patch. Also, the virtual function
  advance_coord_impl() is added to advance the binary/relay log coordinates,
  and two non-virtual functions exec_event() [sic] and skip_event() is added
  that will contain the business logic for executing and skipping events
  respectively.
[10 Nov 2006 14:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/15135

ChangeSet@1.2337, 2006-11-10 15:10:41+01:00, mats@romeo.(none) +8 -0
  BUG#23171 (Illegal slave restart group position):
  Second patch to fix skipping code. Moving relay and binary log 
  position changing code from do_apply_event [old exec_event()] into
  do_update_pos() and doing other changes necessary to support that.
  
  Fixing a bug that can cause deadlock if rotating binary log when committing
  a changes to a transactional table that is not inside a transaction and
  cause a rotate log.
[17 Jan 2007 14:07] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/18260

ChangeSet@1.2338, 2007-01-17 15:06:37+01:00, mats@romeo.(none) +6 -0
  BUG#23171 (Illegal slave restart position):
  Third patch of the bug fix where the code for skipping events and for
  executing events is factored out into three functions:
  - shall_skip() to decide if the event shall be skipped and the
    reason for it;
  - do_apply_event(), where the event is applied to the database; and
  - do_update_pos(), which updates the actual relay log position and
    group positions.
[18 Jan 2007 8:11] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/18312

ChangeSet@1.2339, 2007-01-18 09:11:12+01:00, mats@romeo.(none) +1 -0
  BUG#23171 (Illegal slave restart group position):
  Tail patch: some missing changes that BitKeeper didn't find until
  the merge.
[6 Feb 2007 6:41] Wei Li
I found the patch is only available in 5.1.  I wonder whether it will be patched to 5.0.  Still, I believe the fix should be in the master side.

Thanks,
[6 Feb 2007 6:45] Wei Li
This the binlog event from rpl_loaddata.test in Mysql-5.0.30:

#070205 20:10:41 server id 1  end_log_pos 1684  Query   thread_id=2     exec_time=0     error_code=0
SET TIMESTAMP=1170735041;
create table t1(a int, b int, unique(b));
# at 1684
#070205 20:10:41 server id 2  end_log_pos 90    Query   thread_id=1     exec_time=0     error_code=0
SET TIMESTAMP=1170735041;
insert into t1 values(1,10);
# at 1774
#070205 20:10:41 server id 2  end_log_pos 1801  Xid = 23
COMMIT;

Please notice that the "end_log_pos 90" is wrong and the fix should be in the master side IMHO.
[19 Feb 2007 17:28] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/20074

ChangeSet@1.2406, 2007-02-19 11:12:22+01:00, mats@romeo.(none) +8 -0
  BUG#23171 (Illegal slave restart group position):
  Fixing code that ignores Intvar and Uservar events.
  Fixing code that ignores events when the server id of the event is
  identical to the server id of the server.
  Fixing problems with mysqlbinlog use of new framework.
  Introducing relay-log info specific version of replicate_same_server_id.
[20 Feb 2007 15:46] Rafal Somla
Good to push.
[20 Feb 2007 17:57] Jonathan Miller
Hi,

Wrote a couple of test cases for this today. Looking at the difference between InnoDB and MyISAM, seems that issuing the  SQL_SLAVE_SKIP_COUNTER does not move pass the transaction in InnoDB, but does for MyISAM. This issues with that is that if there is a trigger then the data in the affected table is still updated. 

Files are attached for developer use.

/Jeb
[20 Feb 2007 17:58] Jonathan Miller
Test file

Attachment: rpl_test.test (application/octet-stream, text), 4.19 KiB.

[20 Feb 2007 17:58] Jonathan Miller
innodb reject file

Attachment: rpl_test_innodb.reject (application/octet-stream, text), 7.57 KiB.

[20 Feb 2007 17:58] Jonathan Miller
MyISAM Reject

Attachment: rpl_test_myisam.reject (application/octet-stream, text), 7.36 KiB.

[22 Feb 2007 12:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/20351

ChangeSet@1.2409, 2007-02-22 13:15:19+01:00, mats@romeo.(none) +5 -0
  BUG#23171 (Illegal group log position):
  Adding test cases to verify the behavior.
[1 Mar 2007 16:10] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/20919

ChangeSet@1.2410, 2007-03-01 17:09:33+01:00, mats@romeo.(none) +7 -0
  BUG#23171 (Illegal group log position):
  Moving body of exec_event() into exec_relay_log_event().
  Making shall_skip()/apply_event()/update_pos() inteface public.
[5 Mar 2007 11:01] Rafal Somla
Good to push.
[5 Mar 2007 16:34] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/21165

ChangeSet@1.2411, 2007-03-05 17:32:48+01:00, mats@romeo.(none) +1 -0
  BUG#23171 (Illegal group log position):
  Improving function comment.
[22 Mar 2007 7:33] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22561

ChangeSet@1.2339, 2007-03-22 08:32:41+01:00, mats@romeo.(none) +6 -0
  BUG#23171: Illegal group log position
  
  Tail fixes after re-applying patches to older version of clone.
[22 Mar 2007 16:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/22659

ChangeSet@1.2478, 2007-03-22 16:59:55+01:00, mats@romeo.(none) +2 -0
  BUG#23171: Illegal group log position
  
  Post-merge fixes.
[31 Mar 2007 23:55] Bugs System
Pushed into 5.1.18-beta
[12 Apr 2007 16:04] Paul Dubois
Noted in 5.1.18 changelog.

If a rotate event occured in the middle of a non-transaction group,
the group position would be updated by the rotate event indicating an
illegal group start position that was effectively inside a group.
This can happen if, for example, a rotate occurs between an Intvar
event and the associated Query event, or between the table map events
and the rows events when using row-based replication.