Description:
On a fresh install of MySQL 8.0.15 (on both mysqld and mysqlrouter) in a 3 node InnoDB Cluster, with no clients running through the router, we're seeing mysqlrouter consuming on average 5-6% usr CPU from a core. I asked about this in the slack community channel and was told to create this bug report because the router process shouldn't be consuming that much.
Running strace on the process just hangs on a futex forever:
[root@prod-int-mysql1 tmp]# strace -p 11668
strace: Process 11668 attached
futex(0x7f96f9e46704, FUTEX_WAIT_PRIVATE, 1, NULL^Cstrace: Process 11668 detached
<detached ...>
If we instead check the process with perf we can see that it's 100% spinning in the MDC Refresh thread:
Samples: 3K of event 'cpu-clock', Event count (approx.): 3435000000
Children Self Command Shared Object Symbol
+ 10.71% 0.00% MDC Refresh [unknown] [.] 0000000000000000
+ 5.85% 0.00% MDC Refresh [unknown] [.] 0xf553476e57130187
+ 5.79% 0.00% MDC Refresh [unknown] [.] 0xa6065aa6f07a0015
+ 5.30% 0.00% MDC Refresh [unknown] [.] 0xe626831f55cd1b57
+ 5.24% 0.00% MDC Refresh [unknown] [.] 0xfcca01e81404f2cd
+ 4.54% 4.54% MDC Refresh libc-2.17.so [.] malloc
+ 4.37% 4.37% MDC Refresh libc-2.17.so [.] _int_malloc
+ 4.02% 4.02% MDC Refresh libc-2.17.so [.] _int_free
+ 3.49% 3.49% MDC Refresh libcrypto.so.1.0.2k [.] EVP_DecodeUpdate
Processes running on the machine:
top - 08:42:25 up 25 days, 16:03, 2 users, load average: 0.09, 0.11, 0.13
Tasks: 161 total, 2 running, 159 sleeping, 0 stopped, 0 zombie
%Cpu0 : 1.7 us, 0.3 sy, 0.0 ni, 98.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu1 : 0.0 us, 0.0 sy, 0.0 ni,100.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu2 : 3.3 us, 0.0 sy, 0.0 ni, 96.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
%Cpu3 : 3.3 us, 0.3 sy, 0.0 ni, 96.3 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
GiB Mem : 23.3 total, 11.9 free, 2.9 used, 8.5 buff/cache
GiB Swap: 4.0 total, 4.0 free, 0.0 used. 20.2 avail Mem
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
17229 mysql 20 0 20.3g 2.4g 32.3m S 5.0 10.4 787:06.85 /usr/sbin/mysqld
497 root 20 0 182.4m 122.4m 122.0m S 0.0 0.5 0:41.33 /usr/lib/systemd/systemd-journald
975 root 20 0 420.9m 107.3m 101.8m S 0.0 0.5 3:01.30 /usr/sbin/rsyslogd -n
20962 telegraf 20 0 1193.3m 61.8m 32.4m S 0.0 0.3 38:33.62 /usr/bin/telegraf -config /etc/telegraf/telegraf+
967 root 20 0 294.9m 54.9m 15.2m S 0.0 0.2 1:37.38 /opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/+
730 root 20 0 283.8m 33.5m 32.2m S 0.0 0.1 0:54.77 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --log+
968 root 20 0 560.4m 23.2m 10.5m S 0.0 0.1 4:50.66 /usr/bin/python -Es /usr/sbin/tuned -l -P
705 root 20 0 306.2m 21.3m 13.1m S 0.0 0.1 0:53.97 /usr/libexec/sssd/sssd_be --domain driftldap --u+
692 polkitd 20 0 537.9m 17.2m 8.8m S 0.0 0.1 0:14.22 /usr/lib/polkit-1/polkitd --no-debug
962 root 20 0 555.7m 16.1m 12.4m S 0.0 0.1 0:09.66 /opt/puppetlabs/puppet/bin/pxp-agent --foreground
1583 root 20 0 218.7m 15.8m 11.3m S 0.0 0.1 20:22.25 /usr/sbin/snmpd -LS0-6d -f
2854 root 20 0 812.3m 13.7m 5.9m S 0.0 0.1 2:13.35 /usr/sbin/nsrexecd
694 root 20 0 465.0m 13.3m 11.1m S 0.0 0.1 2:00.60 /usr/sbin/NetworkManager --no-daemon
11668 mysqlro+ 20 0 502.3m 12.9m 8.8m S 5.0 0.1 440:42.90 /usr/bin/mysqlrouter -c /etc/mysqlrouter/mysqlro+
The only queries running in MySQL is from the router and a per 10 second basic check from telegraf (not running through the router):
mysql> SELECT substring(QUERY_SAMPLE_TEXT,1,70),COUNT_STAR, ((AVG_TIMER_WAIT/1000000000) / COUNT_STAR) as avg_time, (SUM_ROWS_AFFECTED/COUNT_STAR) as rows_affect, (SUM_ROWS_SENT/COUNT_STAR) as rows_sent, (SUM_ROWS_EXAMINED/COUNT_STAR) as rows_exam from performance_schema.events_statements_summary_by_digest order by COUNT_STAR desc limit 10\G
*************************** 1. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): show status like 'group_replication_primary_member'
COUNT_STAR: 1914030
avg_time: 0.00000070
rows_affect: 0.0000
rows_sent: 1.0000
rows_exam: 889.9995
*************************** 2. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT member_id, member_host, member_port, member_state, @@group_repl
COUNT_STAR: 1914027
avg_time: 0.00000013
rows_affect: 0.0000
rows_sent: 3.0000
rows_exam: 3.0000
*************************** 3. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT R.replicaset_name, I.mysql_server_uuid, I.role, I.weight, I.ver
COUNT_STAR: 1914011
avg_time: 0.00000031
rows_affect: 0.0000
rows_sent: 3.0000
rows_exam: 7.0001
*************************** 4. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT * FROM mysql_innodb_cluster_metadata.schema_version
COUNT_STAR: 1913997
avg_time: 0.00000017
rows_affect: 0.0000
rows_sent: 1.0000
rows_exam: 1.0000
*************************** 5. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SHOW GLOBAL STATUS
COUNT_STAR: 68246
avg_time: 0.00003032
rows_affect: 0.0000
rows_sent: 442.0000
rows_exam: 884.0000
*************************** 6. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SHOW VARIABLES
COUNT_STAR: 567
avg_time: 0.00446026
rows_affect: 0.0000
rows_sent: 607.0000
rows_exam: 1214.0000
*************************** 7. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT COUNT(*) FROM information_schema.tables WHERE engine = 'MyISAM'
COUNT_STAR: 379
avg_time: 0.00535142
rows_affect: 0.0000
rows_sent: 0.4987
rows_exam: 228.9077
*************************** 8. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SET SESSION wait_timeout=2147483
COUNT_STAR: 378
avg_time: 0.00025642
rows_affect: 0.0000
rows_sent: 0.0000
rows_exam: 0.0000
*************************** 9. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT VERSION()
COUNT_STAR: 378
avg_time: 0.00019759
rows_affect: 0.0000
rows_sent: 1.0000
rows_exam: 0.0000
*************************** 10. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT UUID()
COUNT_STAR: 190
avg_time: 0.00086334
rows_affect: 0.0000
rows_sent: 1.0000
rows_exam: 0.0000
10 rows in set (0.01 sec)
The router queries are really running frequently, but as each router (3 are running at the moment) are asking every node in the Cluster I guess it adds up.
How to repeat:
Install a 3 node InnoDB Cluster 8.0.15 with a centralized router setup (1 router on each DB node), on Oracle Linux 7.5 servers running kernel 4.1.12-124.18.6.el7uek.x86_64.
Without any active clients the router takes up 5-6% CPU usr on average (spiking up to ~8%).
Suggested fix:
Check if the MDC Refresh can be optimized, not sure here but it seems to me that the inner while loop is getting triggered a lot more frequent than the code says that it should?