Bug #92981 Potential connection leak in group replication
Submitted: 29 Oct 2018 4:33 Modified: 14 Oct 2019 11:15
Reporter: Tony Wen Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Group Replication Severity:S2 (Serious)
Version:5.7.22 OS:Oracle Linux
Assigned to: MySQL Verification Team CPU Architecture:Any

[29 Oct 2018 4:33] Tony Wen
Description:
We have an InnoDB cluster deployed in a Kubernetes environment. The mysql nodes are deployed as Kubernetes pods. After running for a while, the mysql primary node got "Too many open files" error and it cannot be connected anymore.  I suspect this related to group replication rejoin mechanism since I didn't find other activity in mysql log before the error.

From the error logs, we can see a mysql slave node failed to join the group and has been trying to join again and again. It happened as this:
1. The slave node made a join request.
2. The primary node got the join request but notified an "Old incarnation" is found(I'm not sure if it's added the slave in group or not at this step).
3. The slave node failed to join the group due to timeout. Since the mysql nodes are deployed in Kubernetes environment, the slave node was restarted to step 1 again and again.  

Part of the primary log:
2018-10-27T11:44:06.785294Z 0 [Note] Plugin group_replication reported: 'Old incarnation found while trying to add node t-f2qzkygryb4juqsk7a6j-mysql2-0.mysql.default.svc.cluster.local:6606 15406406467164180.'
2018-10-27T11:46:05.473817Z 0 [Note] Plugin group_replication reported: 'Old incarnation found while trying to add node t-f2qzkygryb4juqsk7a6j-mysql2-0.mysql.default.svc.cluster.local:6606 15406407654014140.'
2018-10-27T11:53:07.741542Z 0 [Note] Plugin group_replication reported: 'Old incarnation found while trying to add node t-f2qzkygryb4juqsk7a6j-mysql2-0.mysql.default.svc.cluster.local:6606 15406411876682540.'
2018-10-27T11:55:05.509186Z 0 [Note] Plugin group_replication reported: 'Old incarnation found while trying to add node t-f2qzkygryb4juqsk7a6j-mysql2-0.mysql.default.svc.cluster.local:6606 15406413054387170.'
2018-10-27T12:02:10.791283Z 0 [Note] Plugin group_replication reported: 'Old incarnation found while trying to add node t-f2qzkygryb4juqsk7a6j-mysql2-0.mysql.default.svc.cluster.local:6606 15406417307203420.'
2018-10-27T12:04:05.599144Z 0 [Note] Plugin group_replication reported: 'Old incarnation found while trying to add node t-f2qzkygryb4juqsk7a6j-mysql2-0.mysql.default.svc.cluster.local:6606 15406418455289980.'
2018-10-27T12:11:11.736152Z 0 [Note] Plugin group_replication reported: 'Old incarnation found while trying to add node t-f2qzkygryb4juqsk7a6j-mysql2-0.mysql.default.svc.cluster.local:6606 15406422716527940.'
2018-10-27T12:13:06.494944Z 0 [Note] Plugin group_replication reported: 'Old incarnation found while trying to add node t-f2qzkygryb4juqsk7a6j-mysql2-0.mysql.default.svc.cluster.local:6606 15406423864235160.'
2018-10-27T12:19:44.150421Z 0 [ERROR] Error in accept: Too many open files

The part of slave node rejoin failure log:
2018-10-27T12:13:06.449439Z 2 [Note] Plugin group_replication reported: 'Group Replication applier module successfully initialized!'
2018-10-27T12:13:06.449463Z 2 [Note] Plugin group_replication reported: 'auto_increment_increment is set to 7'
2018-10-27T12:13:06.449470Z 2 [Note] Plugin group_replication reported: 'auto_increment_offset is set to 102'
2018-10-27T12:13:06.489833Z 0 [Note] Plugin group_replication reported: 'XCom protocol version: 3'
2018-10-27T12:13:06.489862Z 0 [Note] Plugin group_replication reported: 'XCom initialized and ready to accept incoming connections on port 6606'
2018-10-27T12:13:06.495703Z 0 [Warning] Plugin group_replication reported: 'read failed'
2018-10-27T12:13:06.520693Z 0 [ERROR] Plugin group_replication reported: '[GCS] The member was unable to join the group. Local port: 6606'
2018-10-27T12:14:06.449598Z 2 [ERROR] Plugin group_replication reported: 'Timeout on wait for view after joining group'
2018-10-27T12:14:06.449644Z 2 [Note] Plugin group_replication reported: 'Requesting to leave the group despite of not being a member'
2018-10-27T12:14:06.449662Z 2 [ERROR] Plugin group_replication reported: '[GCS] The member is leaving a group without being on one.'

How to repeat:
The group are deployed in Kubernetes. Not sure how to make primary node keep old incarnation for the rejoin node.

Suggested fix:
Our application code works well with mysql 5.7.20 for quite a while. This issue happened after we upgraded to 5.7.22. And I've double checked our application code and didn't find any connection leak. So, I think there maybe an issue in mysql rejoin part. Please help to check on this. Thanks a lo!
[29 Oct 2018 4:45] Tony Wen
mysql error logs

Attachment: mysql-bug-92981-log.zip (application/zip, text), 504.65 KiB.

[29 Oct 2018 17:19] MySQL Verification Team
Hi Tony,

I have seen number of reports of network related issues with MySQL on kubernetes, and with MySQL running on azure. Nothing we can reproduce with regular servers or regular VM's and our binaries.

Were you able to reproduce the problem on locally run kubernetes (using minicube for e.g.) or only on some 3rd party VM system (google cloud e.g.)? If yes can you prepare a system that reproduces your problem that I can deploy locally so I can test what's going on as a simple setup I made I can't reproduce this.

Thanks
Bogdan
[30 Oct 2018 1:03] Tony Wen
Hi Bogdan,

I didn't experience this issue on my local K8S environment. The issue was seen in K8S cluster created with OCI OKE. But, I take a try to reproduce it. Do you know in which situation the primary node gives this notification : "Old incarnation found while trying to add node ..."? I want  to simulate similar situation and see if the issue can be reproduced. Thanks!

Tony
[14 Oct 2019 11:15] MySQL Verification Team
Hi,

You might want to look at related bug# 97207

This is actually a change from 5.7.22 and it's documented:

[quote]
It is possible for a member to go offline for a short time, then attempt to rejoin the group again before the failure detection mechanism has detected its failure, and before the group has been reconfigured to remove the member. In this situation, the rejoining member forgets its previous state, but if other members send it messages that are intended for its pre-crash state, this can cause issues including possible data inconsistency. If a member in this situation participates in XCom's consensus protocol, it could potentially cause XCom to deliver different values for the same consensus round, by making a different decision before and after failure.

To counter this possibility, from MySQL 5.7.22, servers are given a unique identifier when they join a group. This enables Group Replication to be aware of the situation where a new incarnation of the same server (with the same address but a new identifier) is trying to join the group while its old incarnation is still listed as a member. The new incarnation is blocked from joining the group until the old incarnation can be removed by a reconfiguration. If Group Replication is stopped and restarted on the server, the member becomes a new incarnation and cannot rejoin until the suspicion times out.
[/quote]

More here https://dev.mysql.com/doc/refman/5.7/en/group-replication-group-membership.html

8.0 has an option https://dev.mysql.com/doc/refman/8.0/en/group-replication-options.html#sysvar_group_replic... to control this, there is no such think in 5.7 for now