Description:
Using unreleased 18947309.mysql-advanced-5.7.14-linux-glibc2.5-x86_64.tar.gz Group Replication for testing.
After configuring, I ran some tests to make sure all is working fine (stopping & restarting all nodes one by one with inserts inbetween each restart) and after killing 2nd &
3rd node, 1st node hang with an INSERT:
server1> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID |
MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 9aa83c31-20bc-11e6-aaa8-b86b23917877 |
127.0.0.1 | 3302 | ONLINE |
| group_replication_applier | 9aa840e3-20bc-11e6-aae3-b86b23917877 |
127.0.0.1 | 3303 | ONLINE |
| group_replication_applier | 9aa843a9-20bc-11e6-aa4d-b86b23917877 |
127.0.0.1 | 3301 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
3 rows in set (0,00 sec)
server1>
server1>
server1>
server1> INSERT INTO test.t1 VALUES (11),(12);
Query OK, 2 rows affected (3 min 48,27 sec)
Records: 2 Duplicates: 0 Warnings: 0
The insert only committed after bringing node 2 up again, after 3 mins &
48 secs.
**
* Up to here all ok.
**
But when I checked the status of GR views from another session, I get:
server1> SELECT * FROM
performance_schema.replication_connection_status\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
GROUP_NAME: 8a94f357-aab4-11df-86ab-c80aa9429562
SOURCE_UUID: 8a94f357-aab4-11df-86ab-c80aa9429562
THREAD_ID: NULL
SERVICE_STATE: ON
COUNT_RECEIVED_HEARTBEATS: 0
LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
RECEIVED_TRANSACTION_SET:
8a94f357-aab4-11df-86ab-c80aa9429562:1-20:23-24
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
*************************** 2. row ***************************
CHANNEL_NAME: group_replication_recovery
GROUP_NAME:
SOURCE_UUID:
THREAD_ID: NULL
SERVICE_STATE: OFF
COUNT_RECEIVED_HEARTBEATS: 0
LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
RECEIVED_TRANSACTION_SET:
LAST_ERROR_NUMBER: 0
LAST_ERROR_MESSAGE:
LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
2 rows in set (0,00 sec)
server1> SELECT * FROM
performance_schema.replication_group_member_stats\G
*************************** 1. row ***************************
CHANNEL_NAME: group_replication_applier
VIEW_ID: 14639909840688467:11
MEMBER_ID: 9aa843a9-20bc-11e6-aa4d-b86b23917877
COUNT_TRANSACTIONS_IN_QUEUE: 0
COUNT_TRANSACTIONS_CHECKED: 2
COUNT_CONFLICTS_DETECTED: 0
COUNT_TRANSACTIONS_VALIDATING: 3
TRANSACTIONS_COMMITTED_ALL_MEMBERS:
8a94f357-aab4-11df-86ab-c80aa9429562:1-20
LAST_CONFLICT_FREE_TRANSACTION:
8a94f357-aab4-11df-86ab-c80aa9429562:24
1 row in set (0,00 sec)
server1> SELECT * FROM performance_schema.replication_group_members;
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| CHANNEL_NAME | MEMBER_ID |
MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
| group_replication_applier | 9aa83c31-20bc-11e6-aaa8-b86b23917877 |
127.0.0.1 | 3302 | ONLINE |
| group_replication_applier | 9aa840e3-20bc-11e6-aae3-b86b23917877 |
127.0.0.1 | 3303 | ONLINE |
+---------------------------+--------------------------------------+-------------+-------------+--------------+
2 rows in set (0,00 sec)
however, it was the mysqld's on 3302 & 3303 that I killed:
root@khollman-es:/usr/local/mysql-5.7.14/rapid/plugin/group_replication/src#
ps -ef | grep mysqld
mysql 8420 5877 0 10:21 pts/2 00:00:02 bin/mysqld
--defaults-file=my.cnf --user=mysql
root 10597 5879 0 10:52 pts/12 00:00:00 grep --color=auto mysqld
In the error log, less than 4 iterations of the following occur:
2016-05-23T08:40:18.891768Z 5 [Note] Plugin group_replication
reported: '[XCOM_BINDING_DEBUG] ::send_message()'
2016-05-23T08:40:18.891839Z 5 [Note] Plugin group_replication
reported: '[XCOM_BINDING_DEBUG] ::send_binding_message()::
Pipelining message with payload length 86'
2016-05-23T08:40:18.891857Z 5 [Note] Plugin group_replication
reported: '[XCOM_BINDING_DEBUG] ::send_binding_message():: Sending
message with payload length 86'
2016-05-23T08:40:18.891950Z 5 [Note] Plugin group_replication
reported: '[XCOM_BINDING_DEBUG] ::send_message()::
send_binding_message enum_gcs_error result= 0. Bytes sent:74'
How to repeat:
- once all 3 servers are up and running, kill 2 of them, and on the remaining server, run a single insert.
- in another session, query the performance_schema.replication_group_members view.
Suggested fix:
I was expecting the view to at least reflect the fact that 3302 & 3303 were down, or even, that 3301 (the remaining server) was frozen / waiting to become the majority again.