MySQL Bugs: #86259: Router reconnect time too long

Bug #86259	Router reconnect time too long
Submitted:	10 May 2017 6:45	Modified:	14 Jul 2017 23:26
Reporter:	Mario Beck	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Router	Severity:	S2 (Serious)
Version:	2.1.3	OS:	Oracle Linux
Assigned to:		CPU Architecture:	Any
Tags:	InnoDB Cluster, reconnect

Description:
I found some strange behavior in Router when reconnecting after primary failure in InnoDB Cluster.
The reconnect takes unnecessarily long, depending on the number of connections through router:

one connection: 10sec
Two connections: 20sec
Three connections: 30sec
Four connections: 40sec
Seven connections: 75sec

The problem is router. Group Replication elects the new primary within 5s in each case.
Only via router it takes longer.
I don't want to think what happens in a php world when you have hundred of connections...

How to repeat:
I run a sandbox cluster on ports 4001, 4002, 4003. Node 4001 or 4002 is
usually primary. Router runs on default port 6446.

To see the behavior of GR (without router) I run

   while /bin/true; do
       mysql --silent -N -h 127.0.0.1 -P 4003 -uroot -proot -e
           "SELECT now(), MEMBER_PORT
                          FROM performance_schema.replication_group_members 
                          WHERE MEMBER_ID =
                                (SELECT VARIABLE_VALUE from performance_schema.global_status
                                     WHERE VARIABLE_NAME='group_replication_primary_member');"
           2> /dev/null ;
       sleep 1;
   done

From the output you clearly see that GR takes five seconds to elect a new primary after the old primary was killed.

The client connections are simulated via router:

   while /bin/true; do
       mysql --silent -N -h 127.0.0.1 *-P 6446 *-uroot -proot -e
          "*SELECT NOW(),@@port*" 2> /dev/null ;
       sleep 1;
   done

So I try a reconnect every second. From the output you see that the
reconnect hangs for 75 seconds (7 scripts running in parallel)
(Please ignore the date. My VM is "outdated". Clock is not synchronized
with host OS)

2017-03-12 13:28:48    4002
2017-03-12 13:28:49    4002
2017-03-12 13:28:50    4002     // this is when I did a kill -9 on node 4002, which was the primary
2017-03-12 13:30:05    4001
2017-03-12 13:30:06    4001
2017-03-12 13:30:08    4001

Remarkably there is one connection that succeeds in getting a connection
briefly after failover but one second later it hangs as well:

2017-03-12 13:28:46    4002
2017-03-12 13:28:48    4002
2017-03-12 13:28:49    4002
2017-03-12 13:28:50    4002    // here was the kill -9 on 4002
2017-03-12 13:28:58    4001    // failover happened after 8 seconds. Good.
2017-03-12 13:30:05    4001    // But the next connect was stalled for 67 seconds
2017-03-12 13:30:06    4001

Suggested fix:
I have no clue.
But I see no reason why router needs to wait after primary failover. A reconnect should be possible immediately after GR elected the new primary and set super_read_only=OFF on this new primary. Even if router is adding some extra timeout, this delay should be independent of the number of connection attempts.

verified as reported. Thanks for the report
Bogdan

Posted by developer:
 
Fixed as of the upcoming MySQL Router 2.1.4 release, and here's the changelog entry:

After a primary failure within an InnoDB cluster, MySQL Router required an
unusual amount of time to reconnect.

Thank you for the bug report.