Bug #81565 2 of 3 GR nodes down, but replication_group_members reports otherwise
Submitted: 24 May 2016 8:45 Modified: 24 May 2016 8:48
Reporter: Keith Hollman Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S3 (Non-critical)
Version: OS:Ubuntu (15.10)
Assigned to: CPU Architecture:Any

[24 May 2016 8:45] Keith Hollman
Description:
Using unreleased 18947309.mysql-advanced-5.7.14-linux-glibc2.5-x86_64.tar.gz Group Replication for testing.
After configuring, I ran some tests to make sure all is working fine (stopping & restarting all nodes one by one with inserts inbetween each restart) and after killing 2nd &
3rd node, 1st node hang with an INSERT:

    server1> SELECT * FROM performance_schema.replication_group_members;
    +---------------------------+--------------------------------------+-------------+-------------+--------------+
    | CHANNEL_NAME              | MEMBER_ID                            |
    MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+
    | group_replication_applier | 9aa83c31-20bc-11e6-aaa8-b86b23917877 |
    127.0.0.1   |        3302 | ONLINE       |
    | group_replication_applier | 9aa840e3-20bc-11e6-aae3-b86b23917877 |
    127.0.0.1   |        3303 | ONLINE       |
    | group_replication_applier | 9aa843a9-20bc-11e6-aa4d-b86b23917877 |
    127.0.0.1   |        3301 | ONLINE       |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+
    3 rows in set (0,00 sec)

    server1>
    server1>
    server1>
    server1> INSERT INTO test.t1 VALUES (11),(12);

    Query OK, 2 rows affected (3 min 48,27 sec)
    Records: 2  Duplicates: 0  Warnings: 0

The insert only committed after bringing node 2 up again, after 3 mins &
48 secs.
**
* Up to here all ok.
**

But when I checked the status of GR views from another session, I get:

    server1> SELECT * FROM
    performance_schema.replication_connection_status\G
    *************************** 1. row ***************************
                 CHANNEL_NAME: group_replication_applier
                   GROUP_NAME: 8a94f357-aab4-11df-86ab-c80aa9429562
                  SOURCE_UUID: 8a94f357-aab4-11df-86ab-c80aa9429562
                    THREAD_ID: NULL
                SERVICE_STATE: ON
    COUNT_RECEIVED_HEARTBEATS: 0
     LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
     RECEIVED_TRANSACTION_SET:
    8a94f357-aab4-11df-86ab-c80aa9429562:1-20:23-24
            LAST_ERROR_NUMBER: 0
           LAST_ERROR_MESSAGE:
         LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
    *************************** 2. row ***************************
                 CHANNEL_NAME: group_replication_recovery
                   GROUP_NAME:
                  SOURCE_UUID:
                    THREAD_ID: NULL
                SERVICE_STATE: OFF
    COUNT_RECEIVED_HEARTBEATS: 0
     LAST_HEARTBEAT_TIMESTAMP: 0000-00-00 00:00:00
     RECEIVED_TRANSACTION_SET:
            LAST_ERROR_NUMBER: 0
           LAST_ERROR_MESSAGE:
         LAST_ERROR_TIMESTAMP: 0000-00-00 00:00:00
    2 rows in set (0,00 sec)

    server1> SELECT * FROM
    performance_schema.replication_group_member_stats\G
    *************************** 1. row ***************************
                          CHANNEL_NAME: group_replication_applier
                               VIEW_ID: 14639909840688467:11
                             MEMBER_ID: 9aa843a9-20bc-11e6-aa4d-b86b23917877
           COUNT_TRANSACTIONS_IN_QUEUE: 0
            COUNT_TRANSACTIONS_CHECKED: 2
              COUNT_CONFLICTS_DETECTED: 0
         COUNT_TRANSACTIONS_VALIDATING: 3
    TRANSACTIONS_COMMITTED_ALL_MEMBERS:
    8a94f357-aab4-11df-86ab-c80aa9429562:1-20
        LAST_CONFLICT_FREE_TRANSACTION:
    8a94f357-aab4-11df-86ab-c80aa9429562:24
    1 row in set (0,00 sec)

    server1> SELECT * FROM performance_schema.replication_group_members;
    +---------------------------+--------------------------------------+-------------+-------------+--------------+
    | CHANNEL_NAME              | MEMBER_ID                            |
    MEMBER_HOST | MEMBER_PORT | MEMBER_STATE |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+
    | group_replication_applier | 9aa83c31-20bc-11e6-aaa8-b86b23917877 |
    127.0.0.1   |        3302 | ONLINE       |
    | group_replication_applier | 9aa840e3-20bc-11e6-aae3-b86b23917877 |
    127.0.0.1   |        3303 | ONLINE       |
    +---------------------------+--------------------------------------+-------------+-------------+--------------+
    2 rows in set (0,00 sec)

however, it was the mysqld's on 3302 & 3303 that I killed:

    root@khollman-es:/usr/local/mysql-5.7.14/rapid/plugin/group_replication/src#
    ps -ef | grep mysqld
    mysql     8420  5877  0 10:21 pts/2    00:00:02 bin/mysqld
    --defaults-file=my.cnf --user=mysql
    root     10597  5879  0 10:52 pts/12   00:00:00 grep --color=auto mysqld

In the error log, less than 4 iterations of the following occur:
    2016-05-23T08:40:18.891768Z 5 [Note] Plugin group_replication
    reported: '[XCOM_BINDING_DEBUG] ::send_message()'
    2016-05-23T08:40:18.891839Z 5 [Note] Plugin group_replication
    reported: '[XCOM_BINDING_DEBUG] ::send_binding_message()::
    Pipelining message with payload length 86'
    2016-05-23T08:40:18.891857Z 5 [Note] Plugin group_replication
    reported: '[XCOM_BINDING_DEBUG] ::send_binding_message():: Sending
    message with payload length 86'
    2016-05-23T08:40:18.891950Z 5 [Note] Plugin group_replication
    reported: '[XCOM_BINDING_DEBUG] ::send_message()::
    send_binding_message enum_gcs_error result= 0. Bytes sent:74'

How to repeat:
- once all 3 servers are up and running, kill 2 of them, and on the remaining server, run a single insert.
- in another session, query the performance_schema.replication_group_members view.

Suggested fix:
I was expecting the view to at least reflect the fact that 3302 & 3303 were down, or even, that 3301 (the remaining server) was frozen / waiting to become the majority again.