MySQL Bugs: #27808: Infinite looping in circular replication

Bug #27808	Infinite looping in circular replication
Submitted:	13 Apr 2007 13:02	Modified:	22 Oct 2008 6:44
Reporter:	Lars Thalmann	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.1	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
There are two cases when events can loop forever.

1. If a server fails in circular replication
   and the user fail-over the replication.

2. When using a cluster and an event is created at 
   a cluster server and the event is received by the same
   cluster later by another server.
   
Example for case 1
------------------
Consider the following scenario:

- Replication in circle of three servers: A->B->C->A.
- Server B fails
- User lets C replicate from A instead: A->C->A.

If, at the time B fails, there was an event 
generated by B which has arrived at A, 
but not yet received back to B then this event 
will loop forever in the circle A->C->A.

Example for case 2
------------------
Consider the following scenario:

- One cluster with MySQL servers A,B.
- One cluster with MySQL servers C,D.
- Replication A->C, D->B.

Any row changed at A, will be replicated A->C, D->B and at B 
it will be applied for the second time in the same cluster.

(This bug is weakly related to BUG#17095.)

How to repeat:
See scenarios above.

Suggested fix:
Introduce possibility to filter events from multiple masters 
on a slave:

  CHANGE MASTER SERVER_ID_FILTER=<list of server ids>;

Example:

  CHANGE MASTER SERVER_ID_FILTER=1,2,3;

The intension of this is that the slave will filter all events that
has originating server id either 1, 2, or 3.

This is how one would issue the statement:

Case 1:
-------
When the server B fails in A->B->C->A, one would:

1. Wait for C to process its entire relay log.  Then as much info from
   B as possible have been received by C.

2. Execute on server C, CHANGE MASTER TO SERVER_ID_FILTER=B,C
   (where B,C are the numbers representing the servers)

3. Execute on server C, CHANGE MASTER TO MASTER_HOST=A
   Now we have a circle again, but smaller.

Case 2:
-------
- Replication A->C is set up by on C doing:
  1. CHANGE MASTER TO MASTER_HOST=A, SERVER_ID_FILTER=C,D
  2. START SLAVE

- Replication D->B is set up by on B doing:
  1. CHANGE MASTER TO MASTER_HOST=D, SERVER_ID_FILTER=A,B
  2. START SLAVE

See also BUG#25998.

The patch is on Bug #25998 page.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/49889

2717 Andrei Elkin 2008-07-16
Bug #25998 problems about circle replication
Bug #27808 Infinite looping in circular replication

In case of withdrawing one of the servers from the circular multi-master replication group
events generated by the removed server could become unstoppable (bug#25998).
That's because the originator had been the terminator of the own event flow.

Other possibility of the unstoppable event is the cluster replication (bug#27808).
In that case an event generated by a member of a cluster was
replicated to another member, got accepted and executed.
By that same time effects of the event had been already propagated
across the cluster via the cluster communications.
In order to avoid double-applying, a replication event generated
by a co-member of the cluster should not be accepted.

Both variations of the unstoppable replication event are fixable with
introducing a new option for CHANGE MASTER:

IGNORE_SERVER_IDS= (sid_1, sid_2, ... )

The option can be set to the empty list that resets.

Fixed with implementing the feature.

Properties of the feature:

a. reporting an error if the id of an ignored server is the slave itself and
its configuration on startup was with --replicate-same-server-id;
b. overriding the existing IGNORE_SERVER_IDS list by the following
CHANGE MASTER ... IGNORE_SERVER_IDS= (list), the empty list arg nullifies
the current ignored list;
c. preserving the existing list by CHANGE MASTER w/o IGNORE_SERVER_IDS;
d. preserving the ignored server ids after RESET SLAVE;
e. extending SHOW SLAVE STATUS with a new line listing ignored servers;
f. a new line in master.info with the list of ignored servers;
g. Differently from --replicate-same-server-id handling, the sql thread is not
concerned with the ignored server ids, because it's supposed that
the relay log consists only of events that can not be unstoppable.
In order to guarantee that, e.g in case of the circular replication with a failing
server DBA needs to change master necessarily using the new option.
h. Rotate and FD events originated by the current master listed
in the ignored list are still relay-logged which does not create
any termination issue.
i. The possible list of ignored servers is sorted for the fastest processing of filtering
algorithm.

Two new lines to show slave status output are added: the list of ignored servers and
the current master server id (yet another feature for the user!).

Use cases for this feature can be found on the bug report page.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/49968

2673 Andrei Elkin 2008-07-17
Bug #25998 problems about circle replication
Bug #27808 Infinite looping in circular replication

In case of withdrawing one of the servers from the circular multi-master replication group
events generated by the removed server could become unstoppable (bug#25998).
That's because the originator had been the terminator of the own event flow.

Other possibility of the unstoppable event is the cluster replication (bug#27808).
In that case an event generated by a member of a cluster was
replicated to another member, got accepted and executed.
By that same time effects of the event had been already propagated
across the cluster via the cluster communications.
In order to avoid double-applying, a replication event generated
by a co-member of the cluster should not be accepted.

Both variations of the unstoppable replication event are fixable with
introducing a new option for CHANGE MASTER:

IGNORE_SERVER_IDS= (sid_1, sid_2, ... )

The option can be set to the empty list that resets.

Fixed with implementing the feature.

Properties of the feature:

a. reporting an error if the id of an ignored server is the slave itself and
its configuration on startup was with --replicate-same-server-id;
b. overriding the existing IGNORE_SERVER_IDS list by the following
CHANGE MASTER ... IGNORE_SERVER_IDS= (list), the empty list arg nullifies
the current ignored list;
c. preserving the existing list by CHANGE MASTER w/o IGNORE_SERVER_IDS;
d. preserving the ignored server ids after RESET SLAVE;
e. extending SHOW SLAVE STATUS with a new line listing ignored servers;
f. a new line in master.info with the list of ignored servers;
g. Differently from --replicate-same-server-id handling, the sql thread is not
concerned with the ignored server ids, because it's supposed that
the relay log consists only of events that can not be unstoppable.
In order to guarantee that, e.g in case of the circular replication with a failing
server DBA needs to change master necessarily using the new option.
h. Rotate and FD events originated by the current master listed
in the ignored list are still relay-logged which does not create
any termination issue.
i. The possible list of ignored servers is sorted for the fastest processing of filtering
algorithm.

Two new lines to show slave status output are added: the list of ignored servers and
the current master server id (yet another feature for the user!).

Use cases for this feature can be found on the bug report page.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

http://lists.mysql.com/commits/50006

2673 Andrei Elkin 2008-07-17
Bug #25998 problems about circle replication
Bug #27808 Infinite looping in circular replication

In case of withdrawing one of the servers from the circular multi-master replication group
events generated by the removed server could become unstoppable (bug#25998).
That's because the originator had been the terminator of the own event flow.

Other possibility of the unstoppable event is the cluster replication (bug#27808).
In that case an event generated by a member of a cluster was
replicated to another member, got accepted and executed.
By that same time effects of the event had been already propagated
across the cluster via the cluster communications.
In order to avoid double-applying, a replication event generated
by a co-member of the cluster should not be accepted.

Both variations of the unstoppable replication event are fixable with
introducing a new option for CHANGE MASTER:

IGNORE_SERVER_IDS= (sid_1, sid_2, ... )

The option can be set to the empty list that resets.

Fixed with implementing the feature.

Properties of the feature:

a. reporting an error if the id of an ignored server is the slave itself and
its configuration on startup was with --replicate-same-server-id;
b. overriding the existing IGNORE_SERVER_IDS list by the following
CHANGE MASTER ... IGNORE_SERVER_IDS= (list), the empty list arg nullifies
the current ignored list;
c. preserving the existing list by CHANGE MASTER w/o IGNORE_SERVER_IDS;
d. preserving the ignored server ids after RESET SLAVE;
e. extending SHOW SLAVE STATUS with a new line listing ignored servers;
f. a new line in master.info with the list of ignored servers;
g. Differently from --replicate-same-server-id handling, the sql thread is not
concerned with the ignored server ids, because it's supposed that
the relay log consists only of events that can not be unstoppable.
In order to guarantee that, e.g in case of the circular replication with a failing
server DBA needs to change master necessarily using the new option.
h. Rotate and FD events originated by the current master listed
in the ignored list are still relay-logged which does not create
any termination issue.
i. The possible list of ignored servers is sorted for the fastest processing of filtering
algorithm.

Two new lines to show slave status output are added: the list of ignored servers and
the current master server id (yet another feature for the user!).

Use cases for this feature can be found on the bug report page.

Re-opening this bug.  A bug should only be set to "duplicate" 
if there is a reference to what bug it is duplicate to.

Duplicate of BUG#25998.

Pushed into 6.0.10-alpha (revid:luis.soares@sun.com-20090129165607-wiskabxm948yx463) (version source revid:luis.soares@sun.com-20090129163120-e2ntks4wgpqde6zt) (merge vers: 6.0.10-alpha) (pib:6)