Bug #85034 Memory leak in GTID replication (crash bug)
Submitted: 17 Feb 2017 1:02 Modified: 2 Mar 2017 13:13
Reporter: Trey Raymond Email Updates:
Status: Duplicate Impact on me:
Category:MySQL Server: Replication Severity:S2 (Serious)
Version:5.7.17 OS:Red Hat (6.8)
Assigned to: CPU Architecture:Any

[17 Feb 2017 1:02] Trey Raymond
The leak is in the processing of GTIDs by the slave sql thread.  When it comes across a GTID that has already been executed, it fails to release memory.

Consider (well, build) a test cluster with four hosts, all using GTID replication.
A and B are masters.  A is active (single writer, read_only=0) and B is the backup master (read_only=1).  They replicate from each other.  Log_slave_updates is on.
C and D are slaves.  C is an msr slave, replicating over two channels from both A and B.  D is a traditional slave, only replicating from A - that's our control.
Keep any custom configs pretty minimal.  This was tested with both single and multi threaded slave and happens on both (at the coordinator side in mts, i'd assume), use single threaded to keep it simple.
Start significant write load on A.  I'm using some old retired hosts as a testbed here, 12core-24thread cpu / 48G ram / 4x15k SAS array.  For a write load on that hardware, just using sysbench, one 1M row table, 16 threads in r/w mode, default transaction size.  No traffic is hitting any other hosts - all connections go to A.  That ends up being about 1100 tx/sec, 4400 row writes/sec.  In terms of binary log size, about 1.6 GB/sec.

Lifetime of a transaction in this environment:
- initially committed on A, and sent to B, C, and D through replication.
- C and D commit and do nothing.  B commits and sends it back to A, as well as to our multi-source slave C.
- A and C see that the transaction is already in the executed set and skip redoing it (i.e. gtid serverA:500 where the slave's gtid executed set is serverA:1-550)
note: the order to C doesn't matter, if it gets to C from B first, it executes that, and skips the one from A once that comes up.

So that sounds good, the transaction gets everywhere it needs to be.  The problem is that A and C will both grow in memory use until they fail!  B and D are fine.
This bug rears its head in two different forms based on the server setup.  A isn't the same as C, but both will grow to failure.
On A, the active memory growth rate under that transaction load, for just mysqld, excluding other procs and fs cache, is around 300MB/s.  It keeps growing until all free memory and swap space are filled and the server dies.  Taking periodic samples from performance schema, I see "memory/sql/thd::main_mem_root" total bytes growign by about 265M/minute.  If I stop the sql thread on this host (which isn't executing any changes anyway - it's just the backup channel for the circular setup), the memory is all successfully freed, and the server continues to execute its changes in good health.
On C, the memory growth rate is about half that of A - 150MB/sec.  It will still grow indefinitely, albeit slower.  On this host, in performance schema, the allocation is to "memory/sql/Log_event" - and with samples over time, it shows that growing at about 135MB/sec - again correllating with what the OS says.  Stopping the sql threads on this host, however, does not free the memory - it will stop growing, but is unable to reclaim what was lost.

Another test I did was adding a third master using an msr setup - the growth rate on hosts receiving each transaction three times (and thus skipping two) was double that of hosts which only received the transaction two times (and skipped one).  So, it directly correlates to the already-executed gtid skips.  I'd recommend trying this setup as well - 3 masters, each replicating from both of the others, and two slaves also using msr (can point them around for testing, i.e. one to all 3 and one to only two).

MySQL: 5.7.17
OS: rhel 6.8
Kernel: 2.6.32

How to repeat:
Test case explained in the bug desc, duplicate it.

Suggested fix:
Free memory properly!
[17 Feb 2017 1:17] Trey Raymond
FYI you can get replication running on the described setup with gtid and log_slave_updates both off, and memory will not be an issue.  but that drops functionality and a level of redundancy.

We could do more interesting test cases around log_slave_updates to control how many copies of a tx went to each place, but that's blocked by https://bugs.mysql.com/bug.php?id=84973
[20 Feb 2017 21:01] Trey Raymond
If we get rid of slaves and msr, and look at the problem on the master only...setting relay_log_info_repository=FILE (with sync_relay_log_info=1) causes replication to run without runaway memory use, though of course more slowly - and features such as msr, as well as monitoring any many other things dependent on TABLE, are unusable.  Hopefully that will help figure out where to look...something in its population of the slave_relay_log_info table seems to be causing the issue.
[21 Feb 2017 19:40] Trey Raymond
This has been determined to be two separate but related issues.

On the master (server A in the example), the problem occurs when sync_relay_log_info=1, even though relay_log_info_repository=TABLE.  That variable should be ignored unless repo=FILE.  Somewhere in the code, the paths get crossed.

On the msr slave (server C), the memory usage is from processing logs created with binlog_rows_query_log_events on.  Disabling that on the masters stops the leak on the slaves.  It's apparently loading the statement text into memory, and not releasing it when the gtid is skipped.

Big thanks to Xiang Rao and Przemek Malkowski for their help testing this over the weekend.
[23 Feb 2017 18:51] Trey Raymond
Downgrading this to S2 - while it's serious, we found that mentioned workaround that has only a minor loss of functionality
[2 Mar 2017 13:13] Umesh Shastry
Hello Trey Raymond,

Thank you for the report.
Looks like I'll have to mark this older bug as duplicate of Przemyslaw's newer Bug #85251 which I confirmed today and it has same issue as reported here.