MySQL Bugs: #110821: MySQL Operator: Expose Router Service as LoadBalancer doesnt work

Bug #110821	MySQL Operator: Expose Router Service as LoadBalancer doesnt work
Submitted:	26 Apr 2023 15:39	Modified:	18 Jun 2023 1:32
Reporter:	Christopher Feldhues	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Operator	Severity:	S3 (Non-critical)
Version:	8.0.32	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Hey!
I want to deploy a normal InnoDBCluster to K8s. Most of our applications doesn't run on Kubernetes, so I want to make the Service reachable from outside the cluster. My idea was to create a Service of Type "LoadBalancer", because the team in our house that provides the k8s cluster supports external LoadBalancer. 
When I do this, I get these errors in my MySQL Router:

2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.4.1:60381 (now 40)                                                                       ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] 100.96.3.1:44751 closed connection before finishing handshake                                                                          ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.3.1:44751 (now 42)                                                                       ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] 100.96.2.1:9061 closed connection before finishing handshake                                                                           ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.2.1:9061 (now 41)                                                                        ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] 100.96.5.1:26197 closed connection before finishing handshake                                                                          ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.5.1:26197 (now 40)                                                                       ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] 100.96.1.1:36121 closed connection before finishing handshake                                                                          ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] 100.96.4.1:24957 closed connection before finishing handshake                                                                          ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.4.1:24957 (now 41)                                                                       ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.1.1:36121 (now 41)                                                                       ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] 100.96.0.1:34387 closed connection before finishing handshake                                                                          ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.0.1:34387 (now 42)                                                                       ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] 100.96.3.1:3065 closed connection before finishing handshake                                                                           ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.3.1:3065 (now 43)                                                                        ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] 100.96.2.1:43190 closed connection before finishing handshake                                                                          ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.2.1:43190 (now 42)                                                                       ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] 100.96.5.1:45158 closed connection before finishing handshake                                                                          ││ 2023-04-26 15:33:28 routing INFO [7f4d115bc700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.5.1:45158 (now 41)                                                                       ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] 100.96.1.1:4315 closed connection before finishing handshake                                                                           ││ 2023-04-26 15:33:28 routing INFO [7f4d11dbd700] [routing:bootstrap_rw] incrementing error counter for host of 100.96.1.1:4315 (now 42)

Simultaneously the primary server prints:

 2023-04-26T15:33:28.901869Z 14043 [Note] [MY-010914] [Server] Bad handshake                                                                                                                                   ││ 2023-04-26T15:33:28.902103Z 14045 [Note] [MY-010914] [Server] Bad handshake                                                                                                                                   ││ 2023-04-26T15:33:30.107302Z 14049 [Note] [MY-010914] [Server] Bad handshake                                                                                                                                   ││ 2023-04-26T15:33:30.122941Z 14050 [Note] [MY-010914] [Server] Bad handshake                                                                                                                                   ││ 2023-04-26T15:33:31.956024Z 14056 [Note] [MY-010914] [Server] Bad handshake                                                                                                                                   ││ 2023-04-26T15:33:32.443513Z 14057 [Note] [MY-010914] [Server] Bad handshake                                                                                                                                   ││ 2023-04-26T15:33:34.971282Z 14063 [Note] [MY-010914] [Server] Bad handshake                                                                                                                                   ││ 2023-04-26T15:33:35.203371Z 14065 [Note] [MY-010914] [Server] Bad handshake

When i change the service back to ClusterIP the error disappears.

How to repeat:
apiVersion: mysql.oracle.com/v2
kind: InnoDBCluster
metadata:
  name: mysql-test-db-cluster
  namespace: mysql-test-db
spec:
  secretName: mysql-test-db-secret
  tlsUseSelfSigned: true
  instances: 2
  version: 8.0.32
  router:
    instances: 1
    version: 8.0.32
  datadirVolumeClaimTemplate:
    accessModes:
      - ReadWriteOnce
    resources:
      requests:
        storage: 10Gi
  backupProfiles:
  - name: myfancyprofile  # Embedded backup profile
    dumpInstance:         # MySQL Shell Dump
      storage:
        persistentVolumeClaim:
          claimName: myexample-pvc # store to this pre-existing PVC
      backupSchedules:
        - name: mygreatschedule
          schedule: "0 0 * * *" # Daily, at midnight
          backupProfileName:  myfancyprofile # reference the desired backupProfiles's name
          enabled: true # backup schedules can be temporarily disabled
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: myexample-pvc
spec:
  storageClassName: default
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 10Gi
---
apiVersion: v1
kind: Service
metadata:
  name: mysql-test-db-cluster-outgoing
  namespace: mysql-test-db
  annotations:
    external-dns.alpha.kubernetes.io/hostname: mysql-test-db.dbs-systemtest-t.k8s.lvm.de
spec:
  type: LoadBalancer
  ports:
  - name: mysql
    port: 3306
    protocol: TCP
    targetPort: 6446
  selector:
    component: mysqlrouter
    mysql.oracle.com/cluster: mysql-test-db-cluster
    tier: mysql

Hi,

Thank you for your interest in MySQL. I am not 100% sure this is a bug but I'm too having same issues so I'll see if we can either fix the problem or get the documentation to better explain how to solve this.

Posted by developer:
 
The error indicates that something is connecting to the port, which doesn't use the MySQL protocol. In consequence MySQL blocks the connection.

Could it be that your load balancer has some form of keep-alive check enabled?

As reference: This works for me:

apiVersion: v1
kind: Service
metadata:
  name: mynodeport
  namespace: demo
spec:
  ports:
  - name: mysql
    nodePort: 30134
    port: 3306
    protocol: TCP
    targetPort: 6446
  - name: mysql-ro
    nodePort: 32493
    port: 6447
    protocol: TCP
    targetPort: 6447
  selector:
    component: mysqlrouter
    mysql.oracle.com/cluster: mywithmetric
    tier: mysql
  type: LoadBalancer

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

This seems like an old known issue Bug #90809, and I got it too.
It seenms the solution was just documented and no codes changed. But I cannot agree with it.

There are at least two situations could cause critical problems.
1. the application may not connect to LB for a long time, and the max error count reached, which will cause the application got into trouble.
2. Some types of LB has two private IPs as the source of connections, and both are actively checking healthy. In general one IP is primary and the other is backup, and all traffics are passed by primary IP. The backup IP will be never used if LB doesn't failover to it. So the backup IP will be eventually blocked by mysql router no matter what the max_connect_errors is. As you can imagine, if the LB got any problem and failover to backup IP, the application cannot reach database and a disaster happened.

My suggestion would be a white list of LB IPs. The actions like healthy check from these IPs will be ignored and doesn't count into connection error.