Description:
I found some strange behavior in Router when reconnecting after primary failure in InnoDB Cluster.
The reconnect takes unnecessarily long, depending on the number of connections through router:
one connection: 10sec
Two connections: 20sec
Three connections: 30sec
Four connections: 40sec
Seven connections: 75sec
The problem is router. Group Replication elects the new primary within 5s in each case.
Only via router it takes longer.
I don't want to think what happens in a php world when you have hundred of connections...
How to repeat:
I run a sandbox cluster on ports 4001, 4002, 4003. Node 4001 or 4002 is
usually primary. Router runs on default port 6446.
To see the behavior of GR (without router) I run
while /bin/true; do
mysql --silent -N -h 127.0.0.1 -P 4003 -uroot -proot -e
"SELECT now(), MEMBER_PORT
FROM performance_schema.replication_group_members
WHERE MEMBER_ID =
(SELECT VARIABLE_VALUE from performance_schema.global_status
WHERE VARIABLE_NAME='group_replication_primary_member');"
2> /dev/null ;
sleep 1;
done
From the output you clearly see that GR takes five seconds to elect a new primary after the old primary was killed.
The client connections are simulated via router:
while /bin/true; do
mysql --silent -N -h 127.0.0.1 *-P 6446 *-uroot -proot -e
"*SELECT NOW(),@@port*" 2> /dev/null ;
sleep 1;
done
So I try a reconnect every second. From the output you see that the
reconnect hangs for 75 seconds (7 scripts running in parallel)
(Please ignore the date. My VM is "outdated". Clock is not synchronized
with host OS)
2017-03-12 13:28:48 4002
2017-03-12 13:28:49 4002
2017-03-12 13:28:50 4002 // this is when I did a kill -9 on node 4002, which was the primary
2017-03-12 13:30:05 4001
2017-03-12 13:30:06 4001
2017-03-12 13:30:08 4001
Remarkably there is one connection that succeeds in getting a connection
briefly after failover but one second later it hangs as well:
2017-03-12 13:28:46 4002
2017-03-12 13:28:48 4002
2017-03-12 13:28:49 4002
2017-03-12 13:28:50 4002 // here was the kill -9 on 4002
2017-03-12 13:28:58 4001 // failover happened after 8 seconds. Good.
2017-03-12 13:30:05 4001 // But the next connect was stalled for 67 seconds
2017-03-12 13:30:06 4001
Suggested fix:
I have no clue.
But I see no reason why router needs to wait after primary failover. A reconnect should be possible immediately after GR elected the new primary and set super_read_only=OFF on this new primary. Even if router is adding some extra timeout, this delay should be independent of the number of connection attempts.