MySQL Bugs: #84774: Performance drop every 60 seconds

Bug #84774	Performance drop every 60 seconds
Submitted:	1 Feb 2017 14:53	Modified:	2 Feb 2017 13:18
Reporter:	Rene' Cannao'	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Group Replication	Severity:	S5 (Performance)
Version:	5.7.17	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
While running a write intensive workload on a Group Replication cluster, every 60 seconds throughput drastically drops to immediately returns back to normal.

How to repeat:
Setup a Group Replication cluster: the size doesn't seem to be relevant, it happens with both 3 and 5 nodes.
Run a write intensive workload on a single writer using sysbench. Example:

./sysbench --num-threads=16 --max-time=300 --max-requests=0 --test=./lua/oltp_update_index.lua --mysql-user=sbtest --mysql-password=sbtest --mysql-host=10.1.2.42 --mysql-port=5717 --oltp-table-size=10000000 --oltp-tables-count=8 --report-interval=1 --oltp-auto-inc=off run

See attached output.

Suggested fix:
I haven't check what triggers the bug, therefore I don't have a lot of context to suggest a fix.
Assuming that Group Replication performs some maintenance/housekeeping at regular interval, this maintenance should less aggressive.

sysbench output

Attachment: bug84774.txt (text/plain), 39.84 KiB.

Hello Rene,

Thank you for the report and feedback!

Thanks,
Umesh

Posted by developer:
 
Hi Rene,

Thank you for evaluating Group Replication, your (and all community feedback) is important!

Like you did suggest, Group Replication performs maintenance at a regular interval, more precisely each 60 seconds.
Our performance results show that, section 2.4. Stability over time at
http://mysqlhighavailability.com/performance-evaluation-mysql-5-7-group-replication/

More precisely, every 60 seconds every member exchange its persisted transactions set and the intersection of these sets is used to garbage collect the certification info that each member maintains. On write intensive workloads, like yours, this operation can be longer than expected.
We have plans to improve this.

Best regards,
Nuno Carvalho

I have more severe stalls under sysbench-tpcc workload.
I use Google Cloud n1-highmem-16 or 32 instances to repeat this problem.

Under sysbench-tpcc, 10 tables, scale 100 every 60 sec I see the stall for 5-6 seconds (throughput drops to 0).

More severe problems when I restart the whole cluster (3 nodes). In this case the stalls progressively increase from 5 sec to 40 sec during 10 hours run.

More details about my setup:
Terraform files to deploy Google Cloud nodes:
https://github.com/vadimtk/gce-ansible/blob/master/terraform/group-replication/main.tf

Ansible files to deploy Group Replication:
https://github.com/vadimtk/gce-ansible/tree/master/group-replication

The example of stalls I observe you can find there:
https://raw.githubusercontent.com/Percona-Lab-results/201805-sysbench-tpcc-group-repl/mast...

Same on MySQL 5.7.20:

The sysbench execution:

# Prepare

sysbench --db-driver=mysql --mysql-user=root --mysql-password=msandbox \

  --mysql-socket=/tmp/mysql_sandbox49008.sock --mysql-db=test --range_size=100 \

  --table_size=10000 --tables=200 --threads=1 --events=0 --time=60 \

  --rand-type=uniform /usr/share/sysbench/oltp_read_only.lua prepare

# Execution

  sysbench --db-driver=mysql --mysql-user=root --mysql-password=msandbox \

    --mysql-socket=/tmp/mysql_sandbox49008.sock --mysql-db=test --range_size=100 \

    --table_size=10000 --tables=200 --threads=5 --events=0 --time=6000 \

    --rand-type=uniform /usr/share/sysbench/oltp_read_write.lua --report-interval=1 run

We can see the GR stalling for briefly moments:

[ 65s ] thds: 5 tps: 386.92 qps: 7727.45 (r/w/o: 5411.92/585.40/1730.13) lat (ms,95%): 55.82 err/s: 0.00 reconn/s: 0.00

[ 66s ] thds: 5 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

[ 67s ] thds: 5 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

[ 68s ] thds: 5 tps: 25.00 qps: 489.01 (r/w/o: 340.01/33.00/116.00) lat (ms,95%): 2449.36 err/s: 0.00 reconn/s: 0.00

[ 69s ] thds: 5 tps: 226.93 qps: 4528.61 (r/w/o: 3165.03/312.90/1050.68) lat (ms,95%): 130.13 err/s: 0.00 reconn/s: 0.00

[ 121s ] thds: 5 tps: 212.03 qps: 4228.55 (r/w/o: 2961.39/300.04/967.13) lat (ms,95%): 12.52 err/s: 0.00 reconn/s: 0.00

[ 122s ] thds: 5 tps: 680.12 qps: 13592.38 (r/w/o: 9522.67/1018.18/3051.53) lat (ms,95%): 9.73 err/s: 0.00 reconn/s: 0.00

[ 123s ] thds: 5 tps: 0.00 qps: 28.00 (r/w/o: 8.00/10.00/10.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

[ 124s ] thds: 5 tps: 19.00 qps: 310.00 (r/w/o: 216.00/22.00/72.00) lat (ms,95%): 1589.90 err/s: 0.00 reconn/s: 0.00

[ 167s ] thds: 5 tps: 406.17 qps: 8087.32 (r/w/o: 5662.33/611.25/1813.75) lat (ms,95%): 5.09 err/s: 0.00 reconn/s: 0.00

[ 168s ] thds: 5 tps: 0.00 qps: 0.00 (r/w/o: 0.00/0.00/0.00) lat (ms,95%): 0.00 err/s: 0.00 reconn/s: 0.00

[ 169s ] thds: 5 tps: 491.96 qps: 9857.14 (r/w/o: 6895.40/756.93/2204.81) lat (ms,95%): 9.73 err/s: 0.00 reconn/s: 0.00

Posted by developer:
 
This isn't a bug but a behavior of group replication.

## How Group Replication certifier Garbage collector work

When a transaction is executed on a Group Replication member, the certifier
checks if it exists in certification info and if not stores its write set.

Periodically after every 60 second GR members exchange their applied GTIDs with
all group members. The stable set which is intersection of these applied GTIDs
for all the members is calculated.

The certification garbage collector picks the stable set and validates if
write sets on certification info are contained in it, if they do exists
they are removed from certification info.

When garbage collector is executed all transactions certification is put on
hold, except execution stages before and after certification.

## Why a heavy drop on transactions per second

If certification info starts to grow, so it will the time need to run the
garbage collector.

The increase of certification info can be due an high load of transactions on
the group or a small stable set of the transactions processed by members of the
group.

The value of certification info size can be tracked on
COUNT_TRANSACTIONS_ROWS_VALIDATING column from
performance_schema.replication_group_member_stats table. The value can be
thought of as the current size of the conflict detection database against which
each transaction is certified.

On this case we detected secondaries not being able to keep processing
transactions as fast as primary and group generate a small stable set.

This was observed on the increase of values in columns
COUNT_TRANSACTIONS_IN_QUEUE, number of transactions in the queue pending
conflict detection, and COUNT_TRANSACTIONS_REMOTE_IN_APPLIER_QUEUE, number of
transactions received from replication group which are waiting to be applied,
from performance_schema.replication_group_member_stats table.

## Solution

Secondaries shall keep a small gap of transactions from primary members, flow
control shall be configured to restrain amount of transactions waiting for
certification on variable
**group_replication_flow_control_certifier_threshold** and
**group_replication_flow_control_applier_threshold**.

This will allow group to have a larger stable set that will allow to remove
more write sets from certification info on each iteration, helping garbage
collector being more efficient.

I understand these performance drops are a normal "behavior" of group replication but it is a very annoying behavior for application that have strict SLAs.  Furthermore, it is a behavior that can easily be fixed.  The garbarge collector in the certifier has to clean up the certification info list while holding a lock on the list and a write lock on the gtids. With millions of trx in the certification info, no wonder there is a latency spike reaching a second or more.  

I wrote a simple patch (against 8.0.30) that breaks this process into chunks and releases the locks during a short sleep time between chunks to allow some trx to interleave and thus lower the latency to avoid missing SLAs. Both the chunk size and the sleep time are dynamic MySQL variables, the default values are performing well with sysbench. The patch also increase the chunk size over time if the chunk size was initially set too low. With the default values, the top latencies more than one order of magnitude smaller.

Patch to the Certifier garbage collector to process in chunks

Attachment: patch_gr_cert_loop_on_8_0_30.diff (text/x-patch), 7.44 KiB.

Just to further illustrate, I used a modified sysbench version that allow to specify multiple percentages and closer to 100% (sysbench is normally limited to 99%).  Here's a example of a latency spike caused by the garbage collector:

56s ] thds: 14 tps: 6629.23 qps: 6629.23 (r/w/o: 0.00/6629.23/0.00) lat (ms,99.50%): 4.74 lat (ms,99.90%): 7.04 lat (ms,99.99%): 11.24  err/s: 0.00 reconn/s: 0.00
[ 57s ] thds: 14 tps: 6506.56 qps: 6506.56 (r/w/o: 0.00/6506.56/0.00) lat (ms,99.50%): 5.67 lat (ms,99.90%): 7.43 lat (ms,99.99%): 9.22  err/s: 0.00 reconn/s: 0.00
[ 58s ] thds: 14 tps: 4806.52 qps: 4806.52 (r/w/o: 0.00/4806.52/0.00) lat (ms,99.50%): 6.43 lat (ms,99.90%): 248.83 lat (ms,99.99%): 248.83  err/s: 0.00 reconn/s: 0.00
[ 59s ] thds: 14 tps: 6389.67 qps: 6389.67 (r/w/o: 0.00/6389.67/0.00) lat (ms,99.50%): 5.09 lat (ms,99.90%): 5.77 lat (ms,99.99%): 9.06  err/s: 0.00 reconn/s: 0.00
[ 60s ] thds: 14 tps: 6587.52 qps: 6587.52 (r/w/o: 0.00/6587.52/0.00) lat (ms,99.50%): 4.82 lat (ms,99.90%): 6.21 lat (ms,99.99%): 6.55  err/s: 0.00 reconn/s: 0.00

It appends exactly every minute, when the garbage collection is running. As one can lee, the 99.5% value is not really affected but, 99.9% and 99.99% clearly show the problem with latencies around 30x larger.