Bug #70218 Semisync master plugin with many slaves causes plugin_lock mutex contentions
Submitted: 2 Sep 2013 21:42 Modified: 27 Jan 2014 13:56
Reporter: Yoshinori Matsunobu (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S5 (Performance)
Version:5.6.13 OS:Any
Assigned to:

[2 Sep 2013 21:42] Yoshinori Matsunobu
Description:
If semi-synchronous master plugin is installed on a master and if many slaves (binlog dump threads) are running, plugin_lock mutex contention becomes very hot. This happens even though semi-synchrnous replication is not configured. Here are example vmstat and Linux Perf statistics.

procs -----------memory---------- ---swap-- -----io---- --system-- -----cpu------
 r  b   swpd   free   buff  cache   si   so    bi    bo   in   cs us sy id wa st
40  0      0 71491008 1113192 61081048    0    0     0 11128 44503 199225 55 42  4  0  0
43  0      0 71486336 1113192 61082492    0    0     0 10012 44339 200647 55 41  4  0  0

    43.62%      46440   mysqld  mysqld                [.] mutex_delay
             |
             --- mutex_delay
                |
                |--49.37%-- plugin_lock(THD*, st_plugin_int**)
                |          |
                |          |--33.73%-- Binlog_transmit_delegate::after_send_event(THD*, unsigned short, String
*, char const*, unsigned long long)
                |          |          mysql_binlog_send(THD*, char*, unsigned long long, Gtid_set const*)
                |          |          |
                |          |          |--80.13%-- com_binlog_dump(THD*, char*, unsigned int)
                |          |          |          dispatch_command(enum_server_command, THD*, char*, unsigned i
nt)
                |--33.81%-- plugin_unlock_list(THD*, st_plugin_int**, unsigned int)
                |          |
                |          |--50.40%-- Binlog_transmit_delegate::after_send_event(THD*, unsigned short, String
*, char const*, unsigned long long)
                |          |          mysql_binlog_send(THD*, char*, unsigned long long, Gtid_set const*)
                |          |          |
                |          |          |--80.99%-- com_binlog_dump(THD*, char*, unsigned int)
                |          |          |          dispatch_command(enum_server_command, THD*, char*, unsigned i
nt)
                |--16.73%-- plugin_unlock(THD*, st_plugin_int*)
                |          Binlog_transmit_delegate::reserve_header(THD*, unsigned short, String*)
                |          _ZL21reset_transmit_packetP3THDtPmPPKcP6StringPcm.constprop.76
                |          mysql_binlog_send(THD*, char*, unsigned long long, Gtid_set const*)
                |          |
                |          |--80.62%-- com_binlog_dump(THD*, char*, unsigned int)

    12.70%      13476   mysqld  libpthread-2.14.1.so  [.] pthread_mutex_trylock
             |
             --- pthread_mutex_trylock
                |
                |--99.80%-- my_pthread_fastmutex_lock
                |          |
                |          |--49.35%-- plugin_lock(THD*, st_plugin_int**)
                |          |          |
                |          |          |--34.01%-- Binlog_transmit_delegate::before_send_event(THD*, unsigned s
hort, String*, char const*, unsigned long long)
                |          |          |          mysql_binlog_send(THD*, char*, unsigned long long, Gtid_set const*)
                |          |          |          |

plugin_lock is held via FOREACH_OBSERVER macro.

From rpl_handler.cc:
-----------------------
/*
  This macro is used by almost all the Delegate methods to iterate
  over all the observers running given callback function of the
  delegate .

  Add observer plugins to the thd->lex list, after each statement, all
  plugins add to thd->lex will be automatically unlocked.
 */
#define FOREACH_OBSERVER(r, f, thd, args)                               \
  param.server_id= thd->server_id;                                      \
  /*
     Use a struct to make sure that they are allocated adjacent, check
     delete_dynamic().
  */                                                                    \
  struct {                                                              \
    DYNAMIC_ARRAY plugins;                                              \
    /* preallocate 8 slots */                                           \
    plugin_ref plugins_buffer[8];                                       \
  } s;                                                                  \
  DYNAMIC_ARRAY *plugins= &s.plugins;                                   \
  plugin_ref *plugins_buffer= s.plugins_buffer;                         \
  my_init_dynamic_array2(plugins, sizeof(plugin_ref),                   \
                         plugins_buffer, 8, 8);                         \
  read_lock();                                                          \
  Observer_info_iterator iter= observer_info_iter();                    \
  Observer_info *info= iter++;                                          \
  for (; info; info= iter++)                                            \
  {                                                                     \
    plugin_ref plugin=                                                  \
      my_plugin_lock(0, &info->plugin); 
---------------

How to repeat:
1. Setup normal master/slave replication. On master, set less durable settings (innodb_flush_log_at_trx_commit=2, sync_binlog=0)

2. Run many slaves or binlog readers
* Running many streaming mysqlbinlog readers would be easy to test
for i in `seq 1 40`
do
mysqlbinlog --no-defaults --host=master_host --raw --read-from-remote-server --to-last-log --stop-never  master-binlog.000001 &
done

In the above case, make sure that 40 Binlog Dump threads are running on master

3. Run concurrent inserts on master, and see how many inserts can be done per second on master
* mysqlslap would be easy to test
mysqlslap --concurrency=100 --iterations=1 --engine=innodb --auto-generate-sql --auto-generate-sql-load-type=write --number-of-queries=1000000 --host=master_host

mysqladmin extended-status -i 1 -r  | grep Innodb_rows_inserted

4. Install semisync master plugin on master (without configuring semisync replication), and run the same benchmark
INSTALL PLUGIN rpl_semi_sync_master SONAME 'semisync_master.so';

5. Configure semisync slave, and run the same benchmark

I tested concurrent insert benchmarks (100 clients) then got below results.

* 40 binlog dump threads, semisync master plugin not installed:
40000 inserts/s (on master)

* 40 binlog dump threads, semisync master plugin installed, semisync replication not configured:
10000 inserts/s

* N binlog dump threads, semisync plugin installed, semisync replication configured:
N=1: 28000/s
N=5: 18000/s
N=10: 5622/s
N=20: 1543/s
N=40:  657/s

Suggested fix:
FOREACH_OBSERVER macro should be refactored.
[10 Sep 2013 10:29] Libing Song
Yoshinori,

Thanks for reporting the bug. Verified as described.
[4 Oct 2013 0:49] Yoshinori Matsunobu
We tried read/write lock in plugin_lock mutex. but throughput dropped with many binlog readers, too. I tried on both 5.6.12 and 5.7.2 (with rw plugin_lock patch), and performance drop was similar.

When I tested with 6 semisync readers, Performance Schema said that most of the time were spent on MYSQL_BIN_LOG::COND_done and MYSQL_BIN_LOG::update_cond.
+-------------------------------------------------+------------+-------------------+--------------------+-------------------------------+
| event_name                                      | count_star | cycles_per_use    | nsecs_per_use      |   sum_timer_wait / 1000000000 |
+-------------------------------------------------+------------+-------------------+--------------------+-------------------------------+
| wait/synch/cond/sql/MYSQL_BIN_LOG::COND_done    |    1623042 |   2321702458.9891 |   2321702.45898915 |                  3768220.6024 |
| wait/synch/cond/sql/MYSQL_BIN_LOG::update_cond  |       1294 | 178488587776.0935 | 178488587.77609351 |                   230964.2326 |
| wait/synch/mutex/sql/LOCK_plugin_ref_count      |  100551558 |       578373.9633 |       578.37396331 |                    58156.4031 |
| wait/synch/cond/semisync/COND_binlog_send_      |      10144 |   3655792573.2009 |   3655792.57320091 |                    37084.3599 |
| wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_commit |      54176 |    574114357.4060 |    574114.35740595 |                    31103.2194 |
| wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_done   |     829026 |     31978718.4707 |     31978.71847071 |                    26511.1891 |
| wait/synch/mutex/innodb/log_sys_mutex           |    5759790 |      2535633.8909 |      2535.63389092 |                    14604.7187 |
| wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_sync   |     136661 |     94068061.4575 |     94068.06145751 |                    12855.4353 |
| wait/synch/mutex/innodb/trx_pool_mutex          |        360 |  21291551169.7778 |  21291551.16977778 |                     7664.9584 |
| wait/synch/mutex/sql/MYSQL_BIN_LOG::LOCK_log    |     183493 |     37549067.4537 |     37549.06745374 |                     6889.9910 |
| wait/synch/mutex/sql/Gtid_state                 |    2498645 |      1875735.4205 |      1875.73542052 |                     4686.7969 |
| wait/synch/rwlock/sql/MDL_lock::rwlock          |    5040401 |       589001.2695 |       589.00126947 |                     2968.8026 |
| wait/synch/mutex/innodb/lock_mutex              |    1716887 |      1368366.7427 |      1368.36674272 |                     2349.3311 |
| wait/synch/mutex/semisync/LOCK_binlog_          |   10059514 |       207615.7957 |       207.61579566 |                     2088.5140 |
| wait/synch/mutex/innodb/trx_sys_mutex           |    1680427 |       812003.3383 |       812.00333828 |                     1364.5123 |
| wait/synch/rwlock/innodb/hash_table_locks       |    5490628 |       133412.9837 |       133.41298369 |                      732.5211 |
| wait/synch/mutex/sql/LOCK_table_cache           |    1668519 |       395099.0052 |       395.09900521 |                      659.2302 |
| wait/synch/mutex/sql/THD::LOCK_thd_data         |    4156949 |       134511.7526 |       134.51175257 |                      559.1585 |
| wait/synch/mutex/sql/MDL_map::mutex             |     834473 |       610109.2292 |       610.10922917 |                      509.1197 |
| wait/synch/rwlock/sql/LOCK_grant                |    1673772 |       271363.6023 |       271.36360229 |                      454.2008 |
+-------------------------------------------------+------------+-------------------+--------------------+-------------------------------+
[7 Oct 2013 17:17] Luis Soares
Libing reported something similar:

"""
I verify the performance regression when setting up semisync with 40 dump
threads.

But I would say contention on LOCK_plugin is only one minor cause.
I tried to remove LOCK_plugin from the hook code,
but there is still a remarkable performance regression.

I suspected LOCK_log would be a hot contention too, so I tested on
mysql-trunk in which LOCK_log was removed from dump thread.
Yes, mysql-trunk has better performance, but there is still remarkable
performance regression.

below the test detail:
Test on 5.6:

mysqlslap -S var/tmp/mysqld.1.sock -u root --auto-generate-sql
          --engine=innodb --auto-generate-sql-load-type=write
          -i 3 --number-of-queries=100000 -c 100

        Running for engine innodb
        Average number of seconds to run all queries: 10.360 seconds
        Minimum number of seconds to run all queries: 10.338 seconds
        Maximum number of seconds to run all queries: 10.393 seconds
        Number of clients running queries: 100
        Average number of queries per client: 1000

semisync with 40 dump threads:

        Running for engine innodb
        Average number of seconds to run all queries: 52.056 seconds
        Minimum number of seconds to run all queries: 51.586 seconds
        Maximum number of seconds to run all queries: 52.309 seconds
        Number of clients running queries: 100
        Average number of queries per client: 1000

    after LOCK_plugin was removed

        Running for engine innodb
        Average number of seconds to run all queries: 42.026 seconds
        Minimum number of seconds to run all queries: 41.358 seconds
        Maximum number of seconds to run all queries: 42.661 seconds
        Number of clients running queries: 100
        Average number of queries per client: 1000

Test on trunk:
On trunk, dump thread doesn't use LOCK_log[1]. so it has better performance
than 5.6.

mysqlslap -S var/tmp/mysqld.1.sock -u root --auto-generate-sql
          --engine=innodb --auto-generate-sql-load-type=write
          -i 3 --number-of-queries=100000 -c 100

semisync with no dump thread:
        Running for engine innodb
        Average number of seconds to run all queries: 8.024 seconds
        Minimum number of seconds to run all queries: 8.005 seconds
        Maximum number of seconds to run all queries: 8.059 seconds
        Number of clients running queries: 100
        Average number of queries per client: 1000

semisync with 40 dump threads:

        Running for engine innodb
        Average number of seconds to run all queries: 33.594 seconds
        Minimum number of seconds to run all queries: 33.444 seconds
        Maximum number of seconds to run all queries: 33.680 seconds
        Number of clients running queries: 100
        Average number of queries per client: 1000

    after LOCK_plugin was removed

        Running for engine innodb
        Average number of seconds to run all queries: 21.415 seconds
        Minimum number of seconds to run all queries: 21.068 seconds
        Maximum number of seconds to run all queries: 21.995 seconds
        Number of clients running queries: 100
        Average number of queries per client: 1000

"""

[1] http://my-replication-life.blogspot.com/2013/09/dump-thread-enhancement.html
[11 Oct 2013 8:48] Libing Song
Posted by developer:
 
When I analyzing semisync code for a bug, I found below log information(I added the debug code):

[Note] report commit, is_real_trans 1, log_pos 0, sql SET GLOBAL rpl_semi_sync_master_enabled = 1
[Note] report commit, is_real_trans 1, log_pos 0, sql (null)
[Note] report commit, is_real_trans 1, log_pos 0, sql SELECT UNIX_TIMESTAMP()
[Note] report commit, is_real_trans 1, log_pos 0, sql SHOW VARIABLES LIKE 'SERVER_ID'
[Note] report commit, is_real_trans 1, log_pos 0, sql SET @master_binlog_checksum= @@global.binlog_checksum

[Note] report commit, is_real_trans 1, log_pos 0, sql CREATE TABLE t1(c1 INT) ENGINE=InnoDB
[Note] report commit, is_real_trans 1, log_pos 231, sql CREATE TABLE t1(c1 INT) ENGINE=InnoDB
[Note] report commit, is_real_trans 1, log_pos 231, sql CREATE TABLE t1(c1 INT) ENGINE=InnoDB

[Note] report commit, is_real_trans 0, log_pos 0, sql INSERT INTO t1 VALUES(1)
[Note] report commit, is_real_trans 0, log_pos 0, sql INSERT INTO t1 VALUES(3)
[Note] report commit, is_real_trans 0, log_pos 0, sql INSERT INTO t1 VALUES(5)
[Note] report commit, is_real_trans 1, log_pos 635, sql COMMIT

[Note] report commit, is_real_trans 1, log_pos 843, sql INSERT INTO t1 VALUES(9)
[Note] report commit, is_real_trans 1, log_pos 0, sql (null)
[Note] report commit, is_real_trans 1, log_pos 0, sql (null)
[Note] report commit, is_real_trans 1, log_pos 843, sql (null)
[Note] report commit, is_real_trans 1, log_pos 0, sql (null)
[Note] report commit, is_real_trans 1, log_pos 0, sql (null)
[Note] report commit, is_real_trans 1, log_pos 0, sql (null)
[Note] report commit, is_real_trans 1, log_pos 0, sql (null)
[Note] report commit, is_real_trans 1, log_pos 843, sql (null)

To conclude: semisync after_commit hook is called for almost every command(including statement).
It is called for SELECT, SET and other statements and commands which don't generate binlog events.
It is called for each statement in a transaction(excluding BEGIN).
It is even called three times for a single DDL.

It may not the major cause for this bug, but I think that is a cause which makes LOCK_plugin hot and should  be fixed.
[15 Oct 2013 4:25] Yoshinori Matsunobu
We recently investigated how to fix this problem, and here are our conclusion.

First we tried replacing plugin_lock mutex with read/write mutex. But it actually did not help much. Linux profiling tools showed that plugin_lock still caused contentions.

During profiling, we learned that most/all glibc rw-locks have an internal lock (mutex-like thing) on which threads can stall. The pattern is get lock, get exclusive access to cache line to modify data, release lock.

So switching plugin_lock to read/write lock was actually a bad idea. It was needed to remove below plugin related locks as long as possible.
- plugin_lock
- plugin_lock_list
- plugin_unlock
- plugin_unlock_list

We also noticed that Delegate classes had read/write locks and they caused very hot contentions (especially Binlog_transmit_delegate::lock). The read/write lock protects a list, so probably switching to lock-free list is possible. BTW we noticed that performance schema did not collect mutex statistics on the mutexes on Delegate classes (bug#70577). 

The real problem was all of the above locks were held not only by semisync binlog readers, but also non-semisync binlog readers.

Based on the above factors, we optimized to hold these locks by semisync binlog readers only, and not holding by non-semisync binlog readers. The below is a benchmark result.

<total binlog readers>: <number of semisync>: <inserts/s>
1 1:  30358.23
5 1:  29468.97
10 1: 27918.92
15 1: 27159.15
30 1: 27493.68

30 3: 27129.68

30 30: 1072.23

With many (~30) semisync binlog readers, performance dropped significantly, and plugin_* mutex contention became very hot. But, as long as running very few (2..3) semisync binlog readers and running some (3..10) non-semisync binlog readers, performance penalty was very low.
[6 Dec 2013 12:02] Libing Song
Posted by developer:
 
Hi Yoshinori,

I prepared a patch to remove rpl hook call from non-semisync dump threads. My test shows there still is a big performance regression. The regression is caused by LOCK_log. So I wonder how you get the test result。 Could you please provide more test information to us, e.g configuration, hardware information and where the data is stored(memory, ssd or hard disk).
[8 Dec 2013 1:25] Yoshinori Matsunobu
I tested semisync throughput on the following conditions.
- One master and 1~3 semisync slaves, ~20 non-semisync slaves
- Measuring throughput (innodb_rows_inserted via mysqlslap) on the master, not slaves
- SQL thread were stopped on all slaves. Only I/O threads were running. Actually we extended mysqlbinlog to speak semisync protocol, and used the semisync mysqlbinlog, instead of semisync slaves. Anyway bottlenecks on the master should be the same.
- Master and slaves ran on different machines
- Less durable settings on master (innodb_flush_log_at_trx_commit=0, sync_binlog=0)
- CPU bound benchmark (working data set size is much smaller than RAM). So storage didn't matter

I assume you measured performance on the semisync slave, and saw LOCK_log contention, is that correct? I know that SQL thread and I/O thread conflicts with LOCK_log and semisync performance drops happen on the slave. What if you stop SQL thread on the semisync slave?
[10 Dec 2013 6:04] Libing Song
Posted by developer:
 
Hi Yoshinori,

Thanks for your helpful information.

I tested semisync master performance, but I didn't stop SQL thread on slave. So semisync performance was fairly low and it could not reflect the real effect of the patch. And I stored mysqlbinlog's result in memory, that caused master's LOCK_log fairly hot and had some impact on the test result too. 

After stopping SQL thread and storing mysqlbinlog's result to hard disk, I got below result:
Without the patch:
only semisync:         14000 inserts/s
semisync + 40 dumps:   1000 inserts/s

After fixing it:
only semisync:         14000 inserts/s
semisync + 40 dumps:   10000 inserts/s
[27 Jan 2014 13:56] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

Fix documented in the 5.6.17 and 5.7.4 changelogs as follows:

        Semisynchronous replication became very slow if there were many
        dump threads (such as from mysqlbinlog or slave I/O connections)
        working at the same time. It was also found that semisynchronous
        master plugin functions were called even when the dump
        connections did not support semisynchronous replication, which
        led to locking of the plugin lock as well as wasting time on
        necessary code.

        After this fix, non-semisynchronous dump threads no longer call
        semisynchronous master functions to observe binary events.

Closed.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html
[14 Feb 2014 23:40] Ben Krug
diff for fix against 5.6

Attachment: 5.6_diff.diff (text/x-patch), 15.64 KiB.

[28 Mar 2014 19:21] Laurynas Biveinis
5.6$ bzr log -r 5773 -n0
------------------------------------------------------------
revno: 5773
committer: Libing Song <libing.song@oracle.com>
branch nick: mysql-5.6
timestamp: Sun 2014-01-26 11:23:00 +0800
message:
  Bug#17434690 SEMISYNC MASTER PLUGIN WITH MANY SLAVES CAUSES
               PLUGIN_LOCK MUTEX CONTENTIONS
  
  Semisync became very slow if there were many dump
  threads(mysqlbinlog or slave io connections) working at the
  same time. Despite the dump connections didn't support
  semisync, semisync master plugin functions were still
  called. That caused plugin lock very hot and also wasted
  some time on unnecessary code.
  
  After this fix, non-semisync dump threads don't call
  semisync master functions to observe binary events anymore.
  The implementation is:
  
  * As you know, every dump thread calls transmit_start hook
    to tell semisync master a dump thread starts.
  
    Now, semisync master plugin's transmit_start() returns
    either F_OBSERVE or F_DONT_OBSERVE flag to the dump thread.
  
    F_OBSERVE is returned if the dump connection is a semisync
    connection, otherwise F_DONT_OBSERVE is returned.
  
  * Dump thread will call the hooks to observe every event if
    it get F_OBSERVE flag. Otherwise it will not call the hooks.