Bug #84574 In Group Replication, DDL execute on partitioned node leads to split brain
Submitted: 20 Jan 2017 0:39 Modified: 9 Feb 2017 12:53
Reporter: Przemyslaw Malkowski Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:5.7.17 OS:Any
Assigned to: CPU Architecture:Any
Tags: group replication

[20 Jan 2017 0:39] Przemyslaw Malkowski
Description:
DDLs on partitioned node throw error, but still executing.

cd81c1dadb18 {root} ((none)) > select @@version,@@version_comment;
+------------+------------------------------+
| @@version  | @@version_comment            |
+------------+------------------------------+
| 5.7.17-log | MySQL Community Server (GPL) |
+------------+------------------------------+
1 row in set (0.00 sec)

How to repeat:
Separate one node from cluster by cutting network, disable read_only, then try any DDL.

cd81c1dadb18 {root} ((none)) > SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| CHANNEL_NAME              | MEMBER_ID                            | MEMBER_HOST  | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
| group_replication_applier | 24d6ef6f-dc3f-11e6-abfa-0242ac130004 | cd81c1dadb18 |        3306 | ERROR        |
+---------------------------+--------------------------------------+--------------+-------------+--------------+
1 row in set (0.00 sec)

cd81c1dadb18 {root} ((none)) > set global read_only=0;
Query OK, 0 rows affected (0.00 sec)

cd81c1dadb18 {root} ((none)) > show tables in test1;        
+-----------------+
| Tables_in_test1 |
+-----------------+
| t1              |
+-----------------+
1 row in set (0.00 sec)

cd81c1dadb18 {root} ((none)) > create table test1.split_brain (id int primary key);       
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.

cd81c1dadb18 {root} ((none)) > show tables in test1;                               
+-----------------+
| Tables_in_test1 |
+-----------------+
| split_brain     |
| t1              |
+-----------------+
2 rows in set (0.00 sec)

cd81c1dadb18 {root} ((none)) > insert into test1.split_brain values (1);
ERROR 3100 (HY000): Error on observer while running replication hook 'before_commit'.

cd81c1dadb18 {root} ((none)) > select * from test1.split_brain;
Empty set (0.00 sec)

Suggested fix:
On partitioned nodes, DDLs should be disabled the same way as the DMLs are.
[20 Jan 2017 10:18] Umesh Shastry
Hello Przemyslaw Malkowski,

Thank you for the report.
Observed this with 5.7.17 build.

Thanks,
Umesh
[26 Jan 2017 6:22] Erlend Dahl
Posted by developer:

[20 Jan 2017 2:27] Nuno Carvalho
Hi Umesh,

This is not a bug, user it is clearly disabling the read_mode safe guard.
"""
Separate one node from cluster by cutting network, disable read_only, then
try any DDL.

set global read_only=0;
"""

The user must not change the read_only safe guard.

Best regards,
Nuno Carvalho
[26 Jan 2017 7:44] Przemyslaw Malkowski
Hello Nuno,

I indeed disabled read_only by purpose, but still the node seems to keep refusing writes with ERROR 3100. However, DMLs are really refused, but DDLs are not.
Is this not a bug in this inconsistent behavior? Why there should be any difference between how DML and DDL are treated?
[8 Feb 2017 12:53] Nuno Carvalho
Posted by developer:
 
Hi Przemyslaw,

Thank you for you reply.

Like I said before, you did disable the read_mode safe guard, that it is what ensures that no writes are done when a member goes into ERROR state, by rejecting any statement before it does any change.

About the differences between DML and DDL when you disable the read_mode safe guard, like you know, DDL is not transactional (neither atomic) on MySQL, that is, once it does change something there is no way to roll it back. This is the legacy behaviour since ever, and that is the behaviour that we are seeing when you disable the safe guard.

Best regards,
Nuno Carvalho
[9 Feb 2017 12:53] Umesh Shastry
Thank  you Nuno for the explanation.
Reverting status back to !bg

Thanks,
Umesh