MySQL Bugs: #75570: semi-sync replication performance degrades with a high number of threads

Bug #75570	semi-sync replication performance degrades with a high number of threads
Submitted:	21 Jan 2015 7:52	Modified:	27 Apr 2015 8:27
Reporter:	Rene' Cannao'	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.6.20	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:

Based on our experience (issues in productions and easily reproducible in testing environment), when semi-sync replication is enabled mysqld is able to handle around 6k TRX/s when the number of threads running is relatively low (100-200 connections) but degrades when the number of threads running go beyond a certain threshold. With 3000 threads running, throughput is no more than 300 TRX/s .

While at low number of threads running the bottleneck seems to be network rtt , at high number of threads network activity drops, and we also notice a constantly high number of context switches.

Trying to combine the output of pt-pmp (I think the relevant part is what is listed below) and semi-sync source code, we believe there is a high contention on LOCK_binlog_ in order to compare binlog coordinates of ACK from slave(s) and the binlog coordinates of each thread that issued a commit, and what seems an inefficient way to wake up threads.

   2459 pthread_cond_timedwait,ReplSemiSyncMaster::cond_timewait(semisync_master.so),ReplSemiSyncMaster::commitTrx(semisync_master.so),Trans_delegate::after_commit,MYSQL_BIN_LOG::process_after_commit_stage_queue(binlog.cc:6833),MYSQL_BIN_LOG::ordered_commit(binlog.cc:7217),MYSQL_BIN_LOG::commit(binlog.cc:6609),ha_commit_trans(handler.cc:1500),trans_commit,mysql_execute_command(sql_parse.cc:4711),mysql_parse(sql_parse.cc:6744),dispatch_command(sql_parse.cc:1432),do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6),??
    541 pthread_cond_wait,Stage_manager::enroll_for(mysql_thread.h:1162),MYSQL_BIN_LOG::ordered_commit(binlog.cc:6897),MYSQL_BIN_LOG::commit(binlog.cc:6609),ha_commit_trans(handler.cc:1500),trans_commit,mysql_execute_command(sql_parse.cc:4711),mysql_parse(sql_parse.cc:6744),dispatch_command(sql_parse.cc:1432),do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6),??
...
      1 __lll_lock_wait(libpthread.so.0),_L_lock_1006(libpthread.so.0),pthread_mutex_lock(libpthread.so.0),ReplSemiSyncMaster::lock(semisync_master.so),ReplSemiSyncMaster::updateSyncHeader(semisync_master.so),Binlog_transmit_delegate::before_send_event,mysql_binlog_send(rpl_master.cc:1325),com_binlog_dump(rpl_master.cc:767),dispatch_command(sql_parse.cc:1644),do_handle_one_connection,handle_one_connection,pfs_spawn_thread,start_thread(libpthread.so.0),clone(libc.so.6),??

Each transaction thread waiting an ACK does the follow (simplified):
- hold a lock;
- in a loop ( while (is_on()) ) :
  - compare the ACK position with its own position;
  - it the ACK is still behind, it waits on a condition variable 

In reportReplyBinlog , if there is at least one thread that is waiting an ACK , it sends a broadcast to all the threads.

This also create a contention on LOCK_log in MYSQL_BIN_LOG::ordered_commit .

What seems to be a design flaw is that all the threads wake up, and while it is likely that they will all return to the application in a scenarios with few threads running, with a lot of threads running perhaps only few will return to the application and most of them will go back to wait on the same condition variable. This creates a lot of context switch and CPU get too busy in performing such operations that is not be able to perform a lot of progress with replication.

Attached is the output of pt-pmp resulting from the follow command:
pt-pmp --iterations=1 --save-samples=/tmp/samples.pmp > /tmp/output.pmp
When this was executed, mysqld was processing the traffic generated by sysbench in a write intensive workload running 3000 threads.

How to repeat:
Setup semi-sync replication.

Run a write intensive workload against the master with few threads. Ex:
sysbench --max-requests=0 --max-time=7200 --test=oltp --mysql-user=rcannao --mysql-password=rcannao --mysql-db=rcannao --db-driver=mysql --db-ps-mode=disable --oltp-table-size=10000000 --oltp-point-selects=1 --oltp-index-updates=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --num-threads=30 run

Run a write intensive workload against the master with a lot of threads. Ex:
sysbench --max-requests=0 --max-time=7200 --test=oltp --mysql-user=rcannao --mysql-password=rcannao --mysql-db=rcannao --db-driver=mysql --db-ps-mode=disable --oltp-table-size=10000000 --oltp-point-selects=1 --oltp-index-updates=0 --oltp-simple-ranges=0 --oltp-sum-ranges=0 --oltp-order-ranges=0 --oltp-distinct-ranges=0 --num-threads=3000 run

As a comparison, run the above workload without semi-sync enabled. Run this to verify that the bottleneck is semi-sync.

Suggested fix:
Few spin loops before suspending the thread with pthread_cond_timedwait()

output of pt-pmp

Attachment: semisync_3000threads.tar.gz (application/x-gzip, text), 106.04 KiB.

Hello Rene,

Thank you for the report and test case.
Could you please provide config files from both master/slave used during benchmarks?

Thanks,
Umesh

master config file

Attachment: bug75570_master_config.txt (text/plain), 15.66 KiB.

slave config file

Attachment: bug75570_slave_config.txt (text/plain), 15.66 KiB.

Hi Umesh,
config files attached.

Few notes:
- the servers are running Percona Server, but I don't think is relevant in this context ;
- the way we hit this bug is not because we were running a high number of threads, but any sudden spike in the application that was leading more threads to be executed in parallel, was also causing mysqld to reduce throughput and in turn the application was opening more connections slowing down mysqld even more: in this way threads_running jumped from few hundreds to thousands in short time;
- when such scenarios happen the system is busy doing context switches and throughput decrease drastically: as such, we believe that running spin loops is a better usage of CPU

Thank you Rene for providing the requested details.

Thanks,
Umesh

I encountered similar issue before. Using 5.7 (Loss Less Semisync) improved throughput a lot in our case. You may also try setting sync_binlog=0 and innodb_flush_log_at_trx_commit=0|2 on master.

Hi Rene,
It seems the slave's configuration files you uploaded is not correct.
In the file I saw:
rpl_semi_sync_slave_enabled     OFF

that means semisync slave plugin is not enabled.

Could you please check it?

Hi Libing,

I was running other tests on the same boxes after submitting the bug report, and indeed I had disabled semi-sync .
Sorry for this. I am attaching new global variables for both master and slave.
The only difference is rpl_semi_sync_slave_enabled on slave.
I re-run the test with this change to confirm that the behavior is the same initially reported in the bug.
Again, sorry for the wrong variables file

master config file

Attachment: glovar_master.txt (text/plain), 13.94 KiB.

slave config

Attachment: glovar_slave.txt (text/plain), 13.95 KiB.

My apologies for the delay in getting back to you on this.
Could you please share exact sysbench stats with semi-sync and with async?
Also, could you please provide exact my.cnf from both master/slave(pls make it private after uploading) instead of show global variables?

Will join my test details for your reference just in case I'm missing anything(Had to scale down test case to fit test environment( innodb buffer pool size adjusted to 150G instead of 283G, sysbench max time changed to 5min instead of 2hrs in the test case)).

Regards,
Umesh

test results

Attachment: 75570_results.txt (text/plain), 6.76 KiB.

Also, could you please try out official MySQL builds instead of Percona server build and share sysbench results with exact Semi-sync replication setup(conf file, and whatever steps followed later)? 

Thanks in advance!

Regards,
Umesh

sysbench results

Attachment: 75570_sysbench.txt (text/plain), 10.31 KiB.

Hi Umesh,

I attached config files and sysbench results with details.

Sorry for not attaching config files from the very beginning, but when I used to work at MySQL Support often config files and global variables were not identical, so I tend to ask and provide global variables and not config files.
In fact, in our setup there is no reference to semi-sync in config files: semi-sync is only enabled when the server joins production.

With regards to innodb_buffer_pool_size, this should not be relevant. I reproduced the issue with innodb_buffer_pool_size=16G (see attached config files), and I am confident I can reproduced it with an even smaller buffer pool.

sysbench max time is not relevant either, and 5 minutes are more than enough. In fact the sysbench results I attached now ran for 5 minutes only. In my initial test I run if for longer time (2 hrs) because I was trying to understand if performance just degrade or if transactions/sec stalls completely in the long run.

Something relevant that needs attention is rpl_semi_sync_master_timeout: we increase it to a very high value to not revert to asynchronous replication.
With rpl_semi_sync_master_timeout=10000 (the default) performance are worse than without semi-sync, yet a lot better than with a very high value of rpl_semi_sync_master_timeout because threads reverts to asynchronous replication (so no semi-sync anymore).

In your results I see that with semi-sync enabled you have a max response time of 1890.85ms , therefore in your test rpl_semi_sync_master_timeout doesn't seem to be an issue.

I will try out official MySQL builds instead of Percona server build and share new sysbench results.

Thank you Rene for the sysbench results and conf files.
I agree that global variables etc is enough in most of the cases but the list you sent had other extra variables(Percona server) and had to filter those unavailable variables. 

We(including dev's) are working on this issue and would keep you posted on this. Also, please do share official MySQL build's results when you are done with sysbench tests.

Hello Rene,

This is just to update you that the bug is verified from our end but we would still be interested in seeing your benchmarks stats on our official MySQL builds. Please share the details once you are done with tests.

Regards,
Umesh

Post from the Developer:
------------------------
Below attachment is the preview of the patch. Feel free to give it a spin and
let us know if you find any issues in your environment.

Regards,
Venkatesh

MySQL 5.6 Preview Patch

Attachment: bug75570-mysql-5.6-patch.diff (text/x-patch), 9.60 KiB.

Thank you for your feedback. This has been fixed in upcoming versions and the following was added to the changelog:

When using semi-sync replication performance was degrading when the number of threads increased beyond a certain threshold. To improve performance, now only the thread which is committing is responsible for deleting the active transaction node. All other operations do not touch this active transaction list.

commit a6eed3a7fe61effa7f49cc3990adc44fba9c56a4
Author: Venkatesh Duggirala <venkatesh.duggirala@oracle.com>
Date:   Tue Mar 10 12:32:15 2015 +0530

    Bug#20574628: SEMI-SYNC REPLICATION PERFORMANCE DEGRADES WITH A HIGH NUMBER OF THREADS
    Problem:
    when semi-sync replication is enabled and
    when the number of threads running is increasing (going beyond a
    threshold), performance of the system is going drastically down.
    
    Analysis:
    A transaction in semi-sync replication waits for the ACK
    inside ReplSemiSyncMaster::commitTrx on a condition variable
    COND_binlog_send_. In reportReplyBinlog function, when server
    receives an ACK, it sends condition broadcast to all the
    threads that are waiting on COND_binlog_send_.
    
    Each transaction thread waiting an ACK does the follow (simplified):
    - hold a lock LOCK_binlog_
    - in a loop ( while (is_on()) ) :
      - compare the ACK position with its own position;
         - if the ACK is still behind, it waits on a condition variable COND_binlog_send_
         - if the ACK is ahead or equal, it exits this loop and
            proceeds with the next step
    - unlock the lock LOCK_binlog_
    
    The problem here with this design is that all the threads wake up and while it
    is likely that only few will return to the application and most of them will
    go back to wait on the same condition variable. This creates unnecessary context
    switches and a lot of LLC cache thrashing in multi-cpu systems
    which reduces the throughput of the system.
    
    Fix: The approach splits the *one* condvar used for all threads
    to wait into *several* condvars. Each of the condvars is associated
    with a position range. Thence only those that fall threads that are waiting
    for a given acknowledged range are awaken when the ACK comes back
    (as opposed to waking up all transactions, regardless whether the ACK affects
    them or not).

commit d17d5f47d9f8f11fb0677cbb7620c2d0bf7ca36c
Author: Venkatesh Duggirala <venkatesh.duggirala@oracle.com>
Date:   Mon Mar 30 13:32:57 2015 +0530

    Bug#20574628: SEMI-SYNC REPLICATION PERFORMANCE DEGRADES WITH A HIGH NUMBER OF THREADS
    
    Fixing a post push pb2 valgrind failure:
    ========================================
    
    Analysis: In commitTrx(), when a thread is waiting for the signal
    on a condition variable (entry->cond) after releasing LOCK_binlog_,
    it can happen that this 'entry' node in active transaction list
    deleted by another parallel thread which is doing semi sync off
    (::disableMaster or ::switch_off or resetMaster). This leads to
    invalid memory read issue.
    
    Fix: Now with the new design, only the thread who is doing commit
    is responsible for deleting the active transaction node. All other
    operations (disableMaster, switch_off and resetMaster) will not
    delete nodes from the active transaction list. If disableMaster
    and resetMaster finds that there are no nodes in the list,
    it will delete active_tranxs_ node.

commit a0e941947efe90588d86b39b4d3ec347ee9c7597
Author: Venkatesh Duggirala <venkatesh.duggirala@oracle.com>
Date:   Tue Mar 31 11:18:30 2015 +0530

    Bug#20574628: SEMI-SYNC REPLICATION PERFORMANCE DEGRADES WITH A HIGH NUMBER OF THREADS
    
    Fixing a post push pb2 valgrind failure:
    ========================================
    Rpl_semi_sync::resetMaster does not have to do active_tranxs_ cleanup

This was fixed in 5.6.25.

https://dev.mysql.com/doc/relnotes/mysql/5.6/en/news-5-6-25.html
"Replication: When using semisynchronous replication performance was degrading when the number of threads increased beyond a certain threshold. To improve performance, now only the thread which is committing is responsible for deleting the active transaction node. All other operations do not touch this active transaction list. (Bug #75570, Bug #20574628)"