MySQL Bugs: #95080: MySQL Router consuming mor CPU resources than it should

Bug #95080	MySQL Router consuming mor CPU resources than it should
Submitted:	23 Apr 2019 7:24	Modified:	30 Apr 2019 8:31
Reporter:	Dennis Sehalic	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Router	Severity:	S5 (Performance)
Version:	8.0.15, 8.0.16	OS:	Oracle Linux (Oracle Linux Server 7.5, Linux 4.1.12-124.18.6.el7uek.x86_64 )
Assigned to:		CPU Architecture:	x86 (Intel(R) Xeon(R) CPU E5-2699 v3 @ 2.30GHz)
Tags:	MySQL Router, performance

Description:
On a fresh install of MySQL 8.0.15 (on both mysqld and mysqlrouter) in a 3 node InnoDB Cluster, with no clients running through the router, we're seeing mysqlrouter consuming on average 5-6% usr CPU from a core. I asked about this in the slack community channel and was told to create this bug report because the router process shouldn't be consuming that much.

Running strace on the process just hangs on a futex forever:

[root@prod-int-mysql1 tmp]# strace -p 11668
strace: Process 11668 attached
futex(0x7f96f9e46704, FUTEX_WAIT_PRIVATE, 1, NULL^Cstrace: Process 11668 detached
 <detached ...>

If we instead check the process with perf we can see that it's 100% spinning in the MDC Refresh thread:

Samples: 3K of event 'cpu-clock', Event count (approx.): 3435000000
  Children      Self  Command      Shared Object         Symbol
+   10.71%     0.00%  MDC Refresh  [unknown]             [.] 0000000000000000
+    5.85%     0.00%  MDC Refresh  [unknown]             [.] 0xf553476e57130187
+    5.79%     0.00%  MDC Refresh  [unknown]             [.] 0xa6065aa6f07a0015
+    5.30%     0.00%  MDC Refresh  [unknown]             [.] 0xe626831f55cd1b57
+    5.24%     0.00%  MDC Refresh  [unknown]             [.] 0xfcca01e81404f2cd
+    4.54%     4.54%  MDC Refresh  libc-2.17.so          [.] malloc
+    4.37%     4.37%  MDC Refresh  libc-2.17.so          [.] _int_malloc
+    4.02%     4.02%  MDC Refresh  libc-2.17.so          [.] _int_free
+    3.49%     3.49%  MDC Refresh  libcrypto.so.1.0.2k   [.] EVP_DecodeUpdate

Processes running on the machine:

top - 08:42:25 up 25 days, 16:03,  2 users,  load average: 0.09, 0.11, 0.13
Tasks: 161 total,   2 running, 159 sleeping,   0 stopped,   0 zombie
%Cpu0  :  1.7 us,  0.3 sy,  0.0 ni, 98.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu1  :  0.0 us,  0.0 sy,  0.0 ni,100.0 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu2  :  3.3 us,  0.0 sy,  0.0 ni, 96.7 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
%Cpu3  :  3.3 us,  0.3 sy,  0.0 ni, 96.3 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
GiB Mem :     23.3 total,     11.9 free,      2.9 used,      8.5 buff/cache
GiB Swap:      4.0 total,      4.0 free,      0.0 used.     20.2 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
17229 mysql     20   0   20.3g   2.4g  32.3m S   5.0 10.4 787:06.85 /usr/sbin/mysqld
  497 root      20   0  182.4m 122.4m 122.0m S   0.0  0.5   0:41.33 /usr/lib/systemd/systemd-journald
  975 root      20   0  420.9m 107.3m 101.8m S   0.0  0.5   3:01.30 /usr/sbin/rsyslogd -n
20962 telegraf  20   0 1193.3m  61.8m  32.4m S   0.0  0.3  38:33.62 /usr/bin/telegraf -config /etc/telegraf/telegraf+
  967 root      20   0  294.9m  54.9m  15.2m S   0.0  0.2   1:37.38 /opt/puppetlabs/puppet/bin/ruby /opt/puppetlabs/+
  730 root      20   0  283.8m  33.5m  32.2m S   0.0  0.1   0:54.77 /usr/libexec/sssd/sssd_nss --uid 0 --gid 0 --log+
  968 root      20   0  560.4m  23.2m  10.5m S   0.0  0.1   4:50.66 /usr/bin/python -Es /usr/sbin/tuned -l -P
  705 root      20   0  306.2m  21.3m  13.1m S   0.0  0.1   0:53.97 /usr/libexec/sssd/sssd_be --domain driftldap --u+
  692 polkitd   20   0  537.9m  17.2m   8.8m S   0.0  0.1   0:14.22 /usr/lib/polkit-1/polkitd --no-debug
  962 root      20   0  555.7m  16.1m  12.4m S   0.0  0.1   0:09.66 /opt/puppetlabs/puppet/bin/pxp-agent --foreground
 1583 root      20   0  218.7m  15.8m  11.3m S   0.0  0.1  20:22.25 /usr/sbin/snmpd -LS0-6d -f
 2854 root      20   0  812.3m  13.7m   5.9m S   0.0  0.1   2:13.35 /usr/sbin/nsrexecd
  694 root      20   0  465.0m  13.3m  11.1m S   0.0  0.1   2:00.60 /usr/sbin/NetworkManager --no-daemon
11668 mysqlro+  20   0  502.3m  12.9m   8.8m S   5.0  0.1 440:42.90 /usr/bin/mysqlrouter -c /etc/mysqlrouter/mysqlro+

The only queries running in MySQL is from the router and a per 10 second basic check from telegraf (not running through the router):

mysql> SELECT substring(QUERY_SAMPLE_TEXT,1,70),COUNT_STAR, ((AVG_TIMER_WAIT/1000000000) / COUNT_STAR) as avg_time, (SUM_ROWS_AFFECTED/COUNT_STAR) as rows_affect, (SUM_ROWS_SENT/COUNT_STAR) as rows_sent, (SUM_ROWS_EXAMINED/COUNT_STAR) as rows_exam from performance_schema.events_statements_summary_by_digest order by COUNT_STAR desc limit 10\G
*************************** 1. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): show status like 'group_replication_primary_member'
                       COUNT_STAR: 1914030
                         avg_time: 0.00000070
                      rows_affect: 0.0000
                        rows_sent: 1.0000
                        rows_exam: 889.9995
*************************** 2. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT member_id, member_host, member_port, member_state, @@group_repl
                       COUNT_STAR: 1914027
                         avg_time: 0.00000013
                      rows_affect: 0.0000
                        rows_sent: 3.0000
                        rows_exam: 3.0000
*************************** 3. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT R.replicaset_name, I.mysql_server_uuid, I.role, I.weight, I.ver
                       COUNT_STAR: 1914011
                         avg_time: 0.00000031
                      rows_affect: 0.0000
                        rows_sent: 3.0000
                        rows_exam: 7.0001
*************************** 4. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT * FROM mysql_innodb_cluster_metadata.schema_version
                       COUNT_STAR: 1913997
                         avg_time: 0.00000017
                      rows_affect: 0.0000
                        rows_sent: 1.0000
                        rows_exam: 1.0000
*************************** 5. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SHOW GLOBAL STATUS
                       COUNT_STAR: 68246
                         avg_time: 0.00003032
                      rows_affect: 0.0000
                        rows_sent: 442.0000
                        rows_exam: 884.0000
*************************** 6. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SHOW VARIABLES
                       COUNT_STAR: 567
                         avg_time: 0.00446026
                      rows_affect: 0.0000
                        rows_sent: 607.0000
                        rows_exam: 1214.0000
*************************** 7. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT COUNT(*) FROM information_schema.tables WHERE engine = 'MyISAM'
                       COUNT_STAR: 379
                         avg_time: 0.00535142
                      rows_affect: 0.0000
                        rows_sent: 0.4987
                        rows_exam: 228.9077
*************************** 8. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SET SESSION wait_timeout=2147483
                       COUNT_STAR: 378
                         avg_time: 0.00025642
                      rows_affect: 0.0000
                        rows_sent: 0.0000
                        rows_exam: 0.0000
*************************** 9. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT VERSION()
                       COUNT_STAR: 378
                         avg_time: 0.00019759
                      rows_affect: 0.0000
                        rows_sent: 1.0000
                        rows_exam: 0.0000
*************************** 10. row ***************************
substring(QUERY_SAMPLE_TEXT,1,70): SELECT UUID()
                       COUNT_STAR: 190
                         avg_time: 0.00086334
                      rows_affect: 0.0000
                        rows_sent: 1.0000
                        rows_exam: 0.0000
10 rows in set (0.01 sec)

The router queries are really running frequently, but as each router (3 are running at the moment) are asking every node in the Cluster I guess it adds up.

How to repeat:
Install a 3 node InnoDB Cluster 8.0.15 with a centralized router setup (1 router on each DB node), on Oracle Linux 7.5 servers running kernel 4.1.12-124.18.6.el7uek.x86_64. 

Without any active clients the router takes up 5-6% CPU usr on average (spiking up to ~8%).

Suggested fix:
Check if the MDC Refresh can be optimized, not sure here but it seems to me that the inner while loop is getting triggered a lot more frequent than the code says that it should?

Sorry, forgot the router config:

# File automatically generated during MySQL Router bootstrap
[DEFAULT]
name=system
user=mysqlrouter
keyring_path=/var/lib/mysqlrouter/keyring
master_key_path=/etc/mysqlrouter/mysqlrouter.key
connect_timeout=15
read_timeout=30
dynamic_state=/var/lib/mysqlrouter/state.json

[logger]
level = INFO

[metadata_cache:...]
router_id=...
user=...
metadata_cluster=...
ttl=0.5

Hello Dennis Sehalic,

Thank you for the report and feedback.

regards,
Umesh

- On a huge box with 32 cores, 256G RAM - observed that mysqlrouter consuming 6.7% once in a while and otherwise switching between 2.0%, 2.3% and 2.7% on an idle cluster environment
- No load or activity on the cluster nodes or on mysqlrouter

test results

Attachment: 95080_8.0.15.results (application/octet-stream, text), 18.26 KiB.

Could you please elaborate a bit? Could you reproduce it? Do you know the cause?

Br
Dennis

Sorry, the 2 comments from 8:34 and 8:35 took several hours to show up hehe, so my previous comment is a bit misplaced

Do you know the cause why it's taking up so many resources?

BR
Dennis