Bug #79279 MySQLrouter fails to failover fabric connection under load
Submitted: 13 Nov 2015 20:09
Reporter: Mario Beck Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Router Severity:S1 (Critical)
Version:2.0.2 OS:Linux (OL7)
Assigned to: CPU Architecture:Any
Tags: fabric, failover, mysqlrouter

[13 Nov 2015 20:09] Mario Beck
Description:
mysqlrouter routes connections correctly via a running fabric.
Once the master in the HA group fails, fabric will initiate a failover to a new master. The client connection breaks. The application is trying to reconnect via mysqlrouter but mysqlrouter fails to route the connection to the new master. Instead mysqlrouter spams its errorlogfile with:
2015-11-11 19:56:31 ERROR   [7f11142b0700] Too many open files
This message gets repeated until the logfile fills the filesystem (10GB/h, at least 70000 repeats per second).
Killing mysqlrouter and restarting works fine.

How to repeat:
Here is the mysqlrouter.cnf:
[DEFAULT]
logging_folder = /var/log/mysqlrouter/
plugin_folder = /usr/lib64/mysqlrouter
runtime_folder = /var/run/mysqlrouter
config_folder = /etc/mysqlrouter

[logger]
level = info

[keepalive]
interval = 60

[fabric_cache:mydemo]
address = 127.0.0.1
user = admin

[routing:myfabric]
destinations = fabric+cache://mydemo/group/mygroup
bind_port=3400
mode = read-write

Output of mysqlfabric -group lookup_servers mygroup after the failover:

                         server_uuid        address  status       mode weight
------------------------------------ -------------- ------- ---------- ------
eb5f95c2-885a-11e5-a1a8-080027003c09 localhost:3402 PRIMARY READ_WRITE    1.0
f79d3aff-885a-11e5-b185-080027003c09 localhost:3403  FAULTY READ_WRITE    1.0

lsof of routerprocess when it hangs:
COMMAND     PID USER   FD   TYPE    DEVICE SIZE/OFF      NODE NAME
mysqlrout 31626 root  cwd    DIR     252,0       90  68484819 /home/testy/fabric-5.7/scripts
mysqlrout 31626 root  rtd    DIR     252,0     4096       128 /
mysqlrout 31626 root  txt    REG     252,0    11504  71543950 /usr/sbin/mysqlrouter
mysqlrout 31626 root  mem    REG     252,0    58288  67201308 /usr/lib64/libnss_files-2.17.so
mysqlrout 31626 root  mem    REG     252,0   144208  71543949 /usr/lib64/mysqlrouter/routing.so
mysqlrout 31626 root  mem    REG     252,0    11528  71543947 /usr/lib64/mysqlrouter/keepalive.so
mysqlrout 31626 root  mem    REG     252,0    28360  71543948 /usr/lib64/mysqlrouter/logger.so
mysqlrout 31626 root  mem    REG     252,0  4505032  71543946 /usr/lib64/mysqlrouter/fabric_cache.so
mysqlrout 31626 root  mem    REG     252,0  2099360  67201290 /usr/lib64/libc-2.17.so
mysqlrout 31626 root  mem    REG     252,0    88720  67165000 /usr/lib64/libgcc_s-4.8.3-20140911.so.1
mysqlrout 31626 root  mem    REG     252,0  1141552  67201298 /usr/lib64/libm-2.17.so
mysqlrout 31626 root  mem    REG     252,0   991744  67214100 /usr/lib64/libstdc++.so.6.0.19
mysqlrout 31626 root  mem    REG     252,0   141616  67213764 /usr/lib64/libpthread-2.17.so
mysqlrout 31626 root  mem    REG     252,0   128072  71543942 /usr/lib64/libmysqlharness.so.0
mysqlrout 31626 root  mem    REG     252,0   165128  71543944 /usr/lib64/libmysqlrouter.so.1
mysqlrout 31626 root  mem    REG     252,0    19512  67201296 /usr/lib64/libdl-2.17.so
mysqlrout 31626 root  mem    REG     252,0   164336  67201283 /usr/lib64/ld-2.17.so
mysqlrout 31626 root    0u   CHR     136,1      0t0         4 /dev/pts/1
mysqlrout 31626 root    1u   CHR     136,1      0t0         4 /dev/pts/1
mysqlrout 31626 root    2u   CHR     136,1      0t0         4 /dev/pts/1
mysqlrout 31626 root    3w   REG     252,0 53164354   2325328 /var/log/mysqlrouter/mysqlrouter.log
mysqlrout 31626 root    4u  IPv4 254925267      0t0       TCP localhost:csms2 (LISTEN)
mysqlrout 31626 root    5u  IPv4 254983679      0t0       TCP localhost:60718->localhost:32275 (ESTABLISHED)
mysqlrout 31626 root    6u  sock       0,7      0t0 254986341 protocol: TCPv6
mysqlrout 31626 root    7u  sock       0,7      0t0 254986030 protocol: TCPv6
...
mysqlrout 31626 root 1023u  sock       0,7      0t0 254986862 protocol: TCPv6

There are more than 1000 IPv6 sockets opened. But I don't use IPv6.
My application is Java, when the connections breaks it will automatically reconnect via JDBC.

In 10% of the cases failover works fine with router. In most cases mysqlrouter will die.

Suggested fix:
Seems like during reconnect phase with master not yet available in fabric there are ressources not freed again. But pure speculation.

I expect failover to work.
I also suggest to reduce error log entries in the usual way:

2015-11-11 19:56:31 ERROR   [7f11142b0700] Too many open files
Last message repeated 10000 times