MySQL Bugs: #38501: hold times on prepare_commit_mutex limit read/write scalability

Bug #38501	hold times on prepare_commit_mutex limit read/write scalability
Submitted:	31 Jul 2008 17:45	Modified:	10 Oct 2008 4:06
Reporter:	David Lutz	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	5.0, 5.1, 6.0	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	Contribution

Description:
Due to long hold times, prepare_commit_mutex becomes very hot for
multi-threaded workloads when binlog logging is enabled, limiting
scalability.  For example, with a read-write, oltp sysbench test
when binlog logging is enabled, scaling begins to fall rapidly
from 4 to 8 threads, and there is essentially no scaling beyond
12 threads.

Comments in storage/innobase/handler/ha_innodb.cc:innobase_xa_prepare()
indicate that this lock is used to ensure that the ordering of entries in
the binlog file are the same as those in the InnoDB logs.  The exsmple in
the comments implies that the main concern is the ordering of binlog and
redo log entries.  The lock is acquired very early in innobase_xa_prepare(),
and the lock is held for the entire prepare phase, before anything is
written to the binlog. The lock is then held until after innobase_commit_low()
is called in storage/innobase/handler/ha_innodb.cc:innobase_commit().

It may be possible to acquire the lock later in innobase_xa_prepare(),
reducing hold times and increasing scalability.

How to repeat:
Create a 10,000,000 row sysbench database, enable binlog logging, then execute
the equivalent of "sysbench --max-time=300 --max-requests=0 --test=oltp --oltp-dist-type=special --oltp-table-size=10000000 --num-threads=$THREADS run" scaling the number of threads.  Scaling begins to fall rapidly from 4 to 8
threads, and there is essentially no scaling beyond 12 threads.  Lock
profiling will show that prepare_commit_mutex is among the hottest locks
in mysqld as the number of sysbench threads is scaled.

Suggested fix:
Move the lock acquire as close as possible to the actual update of the binlog
file.  Code inspection and Dtrace tracing suggests that moving the acquire
to a point immediately after the call to trx_prepare_for_mysql() in
innobase_xa_prepare() allows prepares to be completed in parallel, but forces
serialization of the binlog file and redo log updates.  Initial testing of
a proposed patch, applied to 6.0.5, shows more than 20% throughput gain on
a 32 thread sysbench test.

Caveats:

Updates to the undo log occur during the call to trx_prepare_for_mysql().
Acquiring prepare_commit_mutex after calling that function may allow
updates to the undo log to appear in a different order than updates
to the redo log (and the binlog file).  It is unclear to me at this time
whether that is a problem.

In the proposed patch, prepare_commit_mutex is acquired regardless of the
outcome of trx_prepare_for_mysql() (ie even on error).  That matches the
current behavior, since prepare_commit_mutex is currently acquired before
trx_prepare_for_mysql() is called, and continues to be held regardless
of its outcome.

prepare_commit_mutex also relates to bugid 37091 "Binlog filename and
position saved in InnoDB might be inconsistent" which is under review.
The fix for that bug may alter the usage of prepare_commit_mutex.

Proposed Patch:

*** storage/innobase/handler/ha_innodb.cc.orig	Wed Apr 16 06:05:10 2008
--- storage/innobase/handler/ha_innodb.cc	Tue Jul 29 19:01:45 2008
***************
*** 7699,7731 ****
  	int error = 0;
  	trx_t* trx = check_trx_exists(thd);
  
- 	if (thd_sql_command(thd) != SQLCOM_XA_PREPARE &&
- 	    (all || !thd_test_options(thd, OPTION_NOT_AUTOCOMMIT | OPTION_BEGIN)))
- 	{
- 
- 		/* For ibbackup to work the order of transactions in binlog
- 		and InnoDB must be the same. Consider the situation
- 
- 		  thread1> prepare; write to binlog; ...
- 			  <context switch>
- 		  thread2> prepare; write to binlog; commit
- 		  thread1>			     ... commit
- 
- 		To ensure this will not happen we're taking the mutex on
- 		prepare, and releasing it on commit.
- 
- 		Note: only do it for normal commits, done via ha_commit_trans.
- 		If 2pc protocol is executed by external transaction
- 		coordinator, it will be just a regular MySQL client
- 		executing XA PREPARE and XA COMMIT commands.
- 		In this case we cannot know how many minutes or hours
- 		will be between XA PREPARE and XA COMMIT, and we don't want
- 		to block for undefined period of time.
- 		*/
- 		pthread_mutex_lock(&prepare_commit_mutex);
- 		trx->active_trans = 2;
- 	}
- 
  	if (!THDVAR(thd, support_xa)) {
  
  		return(0);
--- 7699,7704 ----
***************
*** 7754,7759 ****
--- 7727,7756 ----
  		ut_ad(trx->active_trans);
  
  		error = (int) trx_prepare_for_mysql(trx);
+ 		if (thd_sql_command(thd) != SQLCOM_XA_PREPARE) {
+ 
+ 			/* For ibbackup to work the order of transactions in binlog
+ 			and InnoDB must be the same. Consider the situation
+ 
+ 			  thread1> prepare; write to binlog; ...
+ 				  <context switch>
+ 			  thread2> prepare; write to binlog; commit
+ 			  thread1>			     ... commit
+ 
+ 			To ensure this will not happen we're taking the mutex on
+ 			prepare, and releasing it on commit.
+ 
+ 			Note: only do it for normal commits, done via ha_commit_trans.
+ 			If 2pc protocol is executed by external transaction
+ 			coordinator, it will be just a regular MySQL client
+ 			executing XA PREPARE and XA COMMIT commands.
+ 			In this case we cannot know how many minutes or hours
+ 			will be between XA PREPARE and XA COMMIT, and we don't want
+ 			to block for undefined period of time.
+ 			*/
+ 			pthread_mutex_lock(&prepare_commit_mutex);
+ 			trx->active_trans = 2;
+ 		}
  	} else {
  		/* We just mark the SQL statement ended and do not do a
  		transaction prepare */

also there's possible to skip the loking of the mutex when this binlog-commit ordering is not required. it could be controlled with a command-line option, that a user can set.

I've done some more testing to determine the full impact of the
prepare_commit_mutex, comparing read/write throughput of MySQL 6.0.5
without binlog enabled, with binlog enabled, with binlog enabled and
the proposed prepare_commit_mutex improvement applied, and with the
acquire and release of prepare_commit_mutex commented out.

With the 6.0.5 baseline, throughput without binlog enabled scaled
moderately to 48 threads, peaking at roughly 1500 TPS.  With binlog
enabled, scaling stops at 12 threads, peaking at roughly 545 TPS, or
36% of throughput without binlog.

By applying the proposed prepare_commit_mutex improvement, scaling
improves somewhat, with very moderate scaling to 32 threads at roughly
690 TPS.  This is roughly a 25% improvement, but is still only 46% of
throughput without binlog enabled.

By commenting out the acquire and release of prepare_commit_mutex,
moderate scaling is achieved to 48 threads at roughly 1390 TPS.  This
is a 250% improvement over the initial binlog throughput, and is
roughly 90% of the throughput without binlog.

This demonstrates the need to break up the serialization of
transactions that is currently enforced by the prepare_commit_mutex
lock.  The proposal to enabled/disable the lock with an option is
tempting, but I think will lead to problems.  Someone may enable the
feature unnecessarily and take a huge performance hit with no benefit,
or worse, disable it when they need it, breaking hot backups.  It also
leaves a very heavy cost for hot backups, even when configured
correctly, and the cost would be easily demonstrated by turning the
option on and off.

I currently plan to go back to the 5.0 source code to see how things
were done there, to see if this presents another option.

Assigning this as a feature request to Inaam. The bug "MySQL/InnoDB group commit is broken in 5.1" is essentially the same bug report.

I think I may have found a solution that eliminates contention on
the prepare_commit_mutex and restores the concurrent commit behavior
that was lost in the 4.1 to 5.0 upgrade (see Bug#13669).  I have a new
prototype that is inspired by the old 4.1 behavior, and it is showing
really great performance.  The change is to do a write to the redo
log without a flush, to secure a place in the redo log that is in
the same order as the binlog, then release the prepare_commit_mutex,
then flush the redo log.  This preserves binlog/redo log ordering
while reducing serialization of InnoDB transactions and allowing
for concurrent commits, and is similar to the old behavior from
MYSQL_LOG::write(Log_event* event_info) in mysql-4.1.22/sql/log.cc.

My testing on a multi-core system shows as much as 2.5X throughput
increase on a 48 thread, read/write, sysbench OLTP test.  At the same
time, I see a 75% reduction in disk writes per transaction due to
coalescing of concurrent commits.

I will upload a diff showing changes to
storage/innobase/handler/ha_innodb.cc from mysql-6.0.5-alpha-pb87.  This
diff includes the previously suggested changes to innobase_xa_prepare(),
as well as additional changes to innobase_commit().

Caveats:

Includes the same caveats as the patch submitted at [31 Jul 19:45]
as well as:

In this prototype, the initial call to innobase_commit_low()
with trx->flush_log_later == true is made without checking for
srv_commit_concurrency > 0 and without acquiring the commit_cond_m lock
or commit_cond condition variable.  This was done to avoid intermixing
acquire and release of prepare_commit_mutex and commit_cond_m, which
might lead to deadlock.  However, srv_commit_concurrency is checked
after prepare_commit_mutex has been released and before performing the
flush to disk.  If it is necessary to check srv_commit_concurrency
before even performing a buffered commit, the prototype could be
modified to perform the check before both the buffered commit and the
flush to disk.

Diff showing changes to storage/innobase/handler/ha_innodb.cc as described in the comment of [13 Aug 21:00]

Attachment: prepare_commit_mutex.patch3 (application/octet-stream, text), 4.80 KiB.

Duplicate of http://bugs.mysql.com/bug.php?id=13669

Please see Bug#13669 for follow up comments and questions about this issue and the proposed fix.

This is a serious performance regression from 4.1. Why was this changed to a feature request?