| Bug #38501 | hold times on prepare_commit_mutex limit read/write scalability | ||
|---|---|---|---|
| Submitted: | 31 Jul 2008 17:45 | Modified: | 10 Oct 2008 4:06 |
| Reporter: | David Lutz | Email Updates: | |
| Status: | Duplicate | Impact on me: | |
| Category: | MySQL Server: InnoDB storage engine | Severity: | S4 (Feature request) |
| Version: | 5.0, 5.1, 6.0 | OS: | Any |
| Assigned to: | Assigned Account | CPU Architecture: | Any |
| Tags: | Contribution | ||
[31 Jul 2008 17:50]
Sergei Golubchik
also there's possible to skip the loking of the mutex when this binlog-commit ordering is not required. it could be controlled with a command-line option, that a user can set.
[5 Aug 2008 16:18]
David Lutz
I've done some more testing to determine the full impact of the prepare_commit_mutex, comparing read/write throughput of MySQL 6.0.5 without binlog enabled, with binlog enabled, with binlog enabled and the proposed prepare_commit_mutex improvement applied, and with the acquire and release of prepare_commit_mutex commented out. With the 6.0.5 baseline, throughput without binlog enabled scaled moderately to 48 threads, peaking at roughly 1500 TPS. With binlog enabled, scaling stops at 12 threads, peaking at roughly 545 TPS, or 36% of throughput without binlog. By applying the proposed prepare_commit_mutex improvement, scaling improves somewhat, with very moderate scaling to 32 threads at roughly 690 TPS. This is roughly a 25% improvement, but is still only 46% of throughput without binlog enabled. By commenting out the acquire and release of prepare_commit_mutex, moderate scaling is achieved to 48 threads at roughly 1390 TPS. This is a 250% improvement over the initial binlog throughput, and is roughly 90% of the throughput without binlog. This demonstrates the need to break up the serialization of transactions that is currently enforced by the prepare_commit_mutex lock. The proposal to enabled/disable the lock with an option is tempting, but I think will lead to problems. Someone may enable the feature unnecessarily and take a huge performance hit with no benefit, or worse, disable it when they need it, breaking hot backups. It also leaves a very heavy cost for hot backups, even when configured correctly, and the cost would be easily demonstrated by turning the option on and off. I currently plan to go back to the 5.0 source code to see how things were done there, to see if this presents another option.
[7 Aug 2008 13:07]
Heikki Tuuri
Assigning this as a feature request to Inaam. The bug "MySQL/InnoDB group commit is broken in 5.1" is essentially the same bug report.
[13 Aug 2008 19:00]
David Lutz
I think I may have found a solution that eliminates contention on the prepare_commit_mutex and restores the concurrent commit behavior that was lost in the 4.1 to 5.0 upgrade (see Bug#13669). I have a new prototype that is inspired by the old 4.1 behavior, and it is showing really great performance. The change is to do a write to the redo log without a flush, to secure a place in the redo log that is in the same order as the binlog, then release the prepare_commit_mutex, then flush the redo log. This preserves binlog/redo log ordering while reducing serialization of InnoDB transactions and allowing for concurrent commits, and is similar to the old behavior from MYSQL_LOG::write(Log_event* event_info) in mysql-4.1.22/sql/log.cc. My testing on a multi-core system shows as much as 2.5X throughput increase on a 48 thread, read/write, sysbench OLTP test. At the same time, I see a 75% reduction in disk writes per transaction due to coalescing of concurrent commits. I will upload a diff showing changes to storage/innobase/handler/ha_innodb.cc from mysql-6.0.5-alpha-pb87. This diff includes the previously suggested changes to innobase_xa_prepare(), as well as additional changes to innobase_commit(). Caveats: Includes the same caveats as the patch submitted at [31 Jul 19:45] as well as: In this prototype, the initial call to innobase_commit_low() with trx->flush_log_later == true is made without checking for srv_commit_concurrency > 0 and without acquiring the commit_cond_m lock or commit_cond condition variable. This was done to avoid intermixing acquire and release of prepare_commit_mutex and commit_cond_m, which might lead to deadlock. However, srv_commit_concurrency is checked after prepare_commit_mutex has been released and before performing the flush to disk. If it is necessary to check srv_commit_concurrency before even performing a buffered commit, the prototype could be modified to perform the check before both the buffered commit and the flush to disk.
[13 Aug 2008 19:03]
David Lutz
Diff showing changes to storage/innobase/handler/ha_innodb.cc as described in the comment of [13 Aug 21:00]
Attachment: prepare_commit_mutex.patch3 (application/octet-stream, text), 4.80 KiB.
[20 Aug 2008 22:37]
Inaam Rana
Duplicate of http://bugs.mysql.com/bug.php?id=13669
[10 Oct 2008 4:06]
David Lutz
Please see Bug#13669 for follow up comments and questions about this issue and the proposed fix.
[7 Aug 2009 16:53]
Mark Callaghan
This is a serious performance regression from 4.1. Why was this changed to a feature request?

Description: Due to long hold times, prepare_commit_mutex becomes very hot for multi-threaded workloads when binlog logging is enabled, limiting scalability. For example, with a read-write, oltp sysbench test when binlog logging is enabled, scaling begins to fall rapidly from 4 to 8 threads, and there is essentially no scaling beyond 12 threads. Comments in storage/innobase/handler/ha_innodb.cc:innobase_xa_prepare() indicate that this lock is used to ensure that the ordering of entries in the binlog file are the same as those in the InnoDB logs. The exsmple in the comments implies that the main concern is the ordering of binlog and redo log entries. The lock is acquired very early in innobase_xa_prepare(), and the lock is held for the entire prepare phase, before anything is written to the binlog. The lock is then held until after innobase_commit_low() is called in storage/innobase/handler/ha_innodb.cc:innobase_commit(). It may be possible to acquire the lock later in innobase_xa_prepare(), reducing hold times and increasing scalability. How to repeat: Create a 10,000,000 row sysbench database, enable binlog logging, then execute the equivalent of "sysbench --max-time=300 --max-requests=0 --test=oltp --oltp-dist-type=special --oltp-table-size=10000000 --num-threads=$THREADS run" scaling the number of threads. Scaling begins to fall rapidly from 4 to 8 threads, and there is essentially no scaling beyond 12 threads. Lock profiling will show that prepare_commit_mutex is among the hottest locks in mysqld as the number of sysbench threads is scaled. Suggested fix: Move the lock acquire as close as possible to the actual update of the binlog file. Code inspection and Dtrace tracing suggests that moving the acquire to a point immediately after the call to trx_prepare_for_mysql() in innobase_xa_prepare() allows prepares to be completed in parallel, but forces serialization of the binlog file and redo log updates. Initial testing of a proposed patch, applied to 6.0.5, shows more than 20% throughput gain on a 32 thread sysbench test. Caveats: Updates to the undo log occur during the call to trx_prepare_for_mysql(). Acquiring prepare_commit_mutex after calling that function may allow updates to the undo log to appear in a different order than updates to the redo log (and the binlog file). It is unclear to me at this time whether that is a problem. In the proposed patch, prepare_commit_mutex is acquired regardless of the outcome of trx_prepare_for_mysql() (ie even on error). That matches the current behavior, since prepare_commit_mutex is currently acquired before trx_prepare_for_mysql() is called, and continues to be held regardless of its outcome. prepare_commit_mutex also relates to bugid 37091 "Binlog filename and position saved in InnoDB might be inconsistent" which is under review. The fix for that bug may alter the usage of prepare_commit_mutex. Proposed Patch: *** storage/innobase/handler/ha_innodb.cc.orig Wed Apr 16 06:05:10 2008 --- storage/innobase/handler/ha_innodb.cc Tue Jul 29 19:01:45 2008 *************** *** 7699,7731 **** int error = 0; trx_t* trx = check_trx_exists(thd); - if (thd_sql_command(thd) != SQLCOM_XA_PREPARE && - (all || !thd_test_options(thd, OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN))) - { - - /* For ibbackup to work the order of transactions in binlog - and InnoDB must be the same. Consider the situation - - thread1> prepare; write to binlog; ... - <context switch> - thread2> prepare; write to binlog; commit - thread1> ... commit - - To ensure this will not happen we're taking the mutex on - prepare, and releasing it on commit. - - Note: only do it for normal commits, done via ha_commit_trans. - If 2pc protocol is executed by external transaction - coordinator, it will be just a regular MySQL client - executing XA PREPARE and XA COMMIT commands. - In this case we cannot know how many minutes or hours - will be between XA PREPARE and XA COMMIT, and we don't want - to block for undefined period of time. - */ - pthread_mutex_lock(&prepare_commit_mutex); - trx->active_trans = 2; - } - if (!THDVAR(thd, support_xa)) { return(0); --- 7699,7704 ---- *************** *** 7754,7759 **** --- 7727,7756 ---- ut_ad(trx->active_trans); error = (int) trx_prepare_for_mysql(trx); + if (thd_sql_command(thd) != SQLCOM_XA_PREPARE) { + + /* For ibbackup to work the order of transactions in binlog + and InnoDB must be the same. Consider the situation + + thread1> prepare; write to binlog; ... + <context switch> + thread2> prepare; write to binlog; commit + thread1> ... commit + + To ensure this will not happen we're taking the mutex on + prepare, and releasing it on commit. + + Note: only do it for normal commits, done via ha_commit_trans. + If 2pc protocol is executed by external transaction + coordinator, it will be just a regular MySQL client + executing XA PREPARE and XA COMMIT commands. + In this case we cannot know how many minutes or hours + will be between XA PREPARE and XA COMMIT, and we don't want + to block for undefined period of time. + */ + pthread_mutex_lock(&prepare_commit_mutex); + trx->active_trans = 2; + } } else { /* We just mark the SQL statement ended and do not do a transaction prepare */