MySQL Bugs: #13669: Group commit is broken in 5.0

Bug #13669	Group commit is broken in 5.0
Submitted:	30 Sep 2005 19:18	Modified:	12 Aug 2009 17:03
Reporter:	Peter Zaitsev (Basic Quality Contributor)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S4 (Feature request)
Version:	5.0	OS:	Any (all)
Assigned to:	Inaam Rana	CPU Architecture:	Any
Tags:	bfsm_2007_06_21 performance, bfsm_2007_06_28, qc

Description:
Group commit is broken in 5.0 after XA was enabled.    This means if binary log is enbabled (normal 
use case even without replication to be able to be recover from backup)  performance may fall significantly.     It is also serious regression compared to MySQL 4.1 

When it happens ? 

It happens if one uses many connections  at the same time using short transactions - so several of them commit at the same time.   It typically shows itself on lower end systems when no write back RAID cache presents.

One of big customers reported regression from 135 to 35 transactions due to this issue. 

How to repeat:
DBT2 or SysBench benchmarks illustrate it.

---

Assigning this fundamental task to Osku.

Hi!

The fix is probably relatively easy: we need to move the slow part of the commit, that is, the InnoDB log flush to disk, to a 'second phase' where we no longer hold any global mutex oe lock. That is how it worked in 4.1.

Though, I have not thought about the XA implications in detail. If XA flushes its own binlog file to disk without a group commit, then we would need to implement a group commit to the binlog, too, so that the XA algorithm would be fundamentally right.

The reason for the current bottleneck in 5.0 is that the transaction serialization order in ib_logfiles and in the binlog must be the same, so that a point-in-time recovery from an InnoDB hot backup works.

Regards,

Heikki

Unfortunately, you cannot flush the logs only in the second phase, it must happen in _every_ phase. When prepare() returns a success it is a guarantee that data was written to a persistent storage and can be recovered and committed or rolled back after a crash. For example:

Innodb: prepare, no flush to disk
binlog: log the transaction, flush to disk (explicit, or implicit by OS)
--- crash

on recovery InnoDB cannot commit the transaction, because it's not in the logs :(

What I don't understand is why people not using XA have to pay for the inevitable performance losses that come from additional guarantees made due to XA. Surely if XA is not used, group commit could work as it did in 4.1. If XA is used and it needs additional syncs or whatever, then so be it, but that's a price that XA users have to pay.

Notes from meeting with Heikki and others:

No changes needed in MySQL, Heikki says. When he gets around to it,
he'll find out for sure--it's a resource issue. If he hits something
where MySQL needs to do something, he'll let us know.

Agreed Osku.... 

This seems like a bug that really needs to get fixed.

Changing severity to Feature Request.

Please see Bug#38501 "hold times on prepare_commit_mutex limit read/write scalability" for a proposed patch that addresses group commit and substantially improves scalability for write workloads.

Bug#38501 was determined to be a duplicate of this bug, so I will follow up here with additional comments and questions.

I've done quite a bit of additional review of the MySQL/Innodb code to
try to verify whether my prototype patch for bugs 38501 and 13669
maintains the required ordering between the binlog file and the Innodb
logs.  Comments in the code indicate that Innodb hot backup has a
requirement for the ordering of calls to binlog_commit() and
innobase_commit().  After the additional code review, I am curious
whether this requirement still exists.

Do we have a good description of the specific ordering requirements for
Innodb hot backup and why they exist?  For example, I suspect that one
reason for the ordering requirement was that in MySQL 4.1, on startup
with --innodb-safe-binlog, a check was performed to determine if any
transactions had been committed to the binlog that were not also
committed to Innodb.  If any were found, MYSQL_LOG::cut_spurious_tail()
truncated the binlog file to match Innodb.  If ordering between the
binlog and Innodb was not maintained, we could have the following:

   thread1> write to binlog; ...
            <context switch>
   thread2> write to binlog; commit
            <crash>

On startup, the last committed transaction in Innodb would appear in the
binlog file after an uncommitted transaction, and a simple truncation of
the binlog file could not be used to remove that entry.  This behavior
has been completely changed in 5.x, and the --innodb-safe-binlog is no
longer needed.  This raises the question of whether the ordering
requirement between the binlog file and Innodb logs still exists, and if
so, whether it extends to the prepare phase.

If I understand the 4.1 commit behavior correctly, ha_commit_trans() had
no prepare phase and the binlog LOCK_log lock was used to protect
updates to both the binlog file and the Innodb logs.
mysql_bin_log.write() acquired LOCK_log, then wrote the transaction's
binlog cache to the binlog file, wrapped in a BEGIN/COMMIT pair, then
called ha_report_binlog_offset_and_commit() while still holding
LOCK_log.  This ultimately called trx_commit_off_kernel() which set
undo->state to one of TRX_UNDO_CACHED, TRX_UNDO_TO_FREE, or
TRX_UNDO_TO_PURGE, then called trx_sys_update_mysql_binlog_offset() and
mtr_commit().  Prior to the call to trx_commit_off_kernel(),
trx->flush_log_later was set to TRUE so the Innodb log would not be
flushed to disk.  Then trx->flush_log_later was set to FALSE, LOCK_log
was released, and ha_commit_complete() was called.  This ultimately
called trx_commit_complete_for_mysql() which flushed the Innodb log to
disk.

The key in 4.1 is that the update to the Innodb logs was not flushed to
disk until after LOCK_log was released.  This allowed for concurrent
commits and also reduced hold time and contention on LOCK_log, allowing
for better scalability while still maintaining the write ordering that
was required at that time.

The XA transaction support that was added for 5.x introduced a prepare
phase to ha_commit_trans().  From what I can see, the prepare phase for
binlog does nothing, but the prepare phase for Innodb now acquires the
prepare_commit_mutex lock and then calls trx_prepare_for_mysql().  This
ultimately calls trx_prepare_off_kernel() which sets undo->state to
TRX_UNDO_PREPARED and undo->xid to trx->xid, then calls mtr_commit().
The Innodb log is then flushed to disk and ha_commit_one_phase() is
called, which then calls binlog_commit() and innobase_commit() to
perform the commit phase.  This phase looks essentially the same as the
4.1 commit phase, writing the binlog file then setting undo->state to
one of the same three states mentioned for 4.1, then calling
trx_sys_update_mysql_binlog_offset() and mtr_commit().  We then flush
the Innodb log to disk while still holding the prepare_commit_mutex
lock.  The binlog LOCK_log is now only used to serialize updates to the
binlog file, and the prepare_mutex_lock now enforces ordering between
the binlog and the Innodb log.  Unfortunately, this may include two
synchronous disk flushes while holding the lock, one for
trx_prepare_off_kernel() and one for trx_commit_off_kernel().  The lock
prevents concurrent commits and limits scalability.

The changes I introduced in my prototype patch include moving the
acquisition of prepare_commit_mutex to a point after
trx_prepare_off_kernel(), and restoring the steps in 4.x of setting and
restoring trx->flush_log_later to avoid a disk flush in
trx_commit_off_kernel(). This is then followed by the release of
prepare_commit_mutex and a call to trx_commit_complete_for_mysql(), much
like what was done in 4.x.

I am fairly confident that the changes related to the commit phase
maintain the ordering of binlog_commit() and innobase_commit() as was
done in 4.x.  As I already mentioned, I am curious whether that is till
required, and whether it extends to the prepare phase.

The 5.x XA support changed the behavior of database recovery on startup,
and the call to MYSQL_LOG::cut_spurious_tail() is no longer used.  My
understanding is that on startup, MySQL calls TC_LOG_BINLOG::recover to
scan the binlog for events with type XID_EVENT, then call the storage
engine xarecover code to look for events that are still in a prepared
state.  MySQL then calls the storage engine commit_by_xid code to commit
these transactions.  When TC_LOG_BINLOG::recover is complete, MySQL
calls the storage engine xarecover code a second time to look for
transactions that are still in the prepared state, which means that they
made it through the prepare phase but did not get committed to the
binlog file.  MySQL then calls the storage engine rollback_by_xid code
to roll back these transactions.

In essence, the binlog file appears to be the source of record in MySQL
5.x for transactions that must be committed to the database.  If a
transaction is prepared but not committed to the binlog, it will be
rolled back during recovery.  If it is prepared and committed to the
binlog, but not committed to the storage engine, it will be committed
during recovery.  If it was prepared and committed to both the binlog
and the storage engine, it will remain committed.

Given all of the above, I have the following basic questions:

Are there any fundamental flaws in my understanding of the 4.x and
5.1 commit and startup/recovery processing?

What are the specific binlog/Innodb ordering requirements for Innodb hot
backup and why do they exist?

Are they still relevant given the changes to commit processing and
startup recovery?

If not, do we need the prepare_commit_mutex lock at all?

If the ordering is still required, do the changes I have proposed
satisfy the requirements, or do we need to maintain ordering with the
prepare phase as well?

Thanks!

David,

It is not only question of Innodb Hot Backup.

Generally (assuming sync-binlog=1 is defined) we must  have replication log in sync with transactional log.  Otherwise in case of master crash slave can be potentially inconsistent,  LVM backup may not work etc.

It is great to see this bug is somethat fixed but it is very important to ensure fix is safe.

Peter,

Thanks for the feedback.  I'm not sure yet if there is a replication
related requirement for the ordering of the binlog and storage engine
commits, but I don't currently think that it extends to the prepare
phase.

A prepare is invisible to other transactions, even on the master, until
it is committed.  Since the prepare is effectively invisible, I'm not
sure why the order of prepares relative to commits would matter.  Once
the transaction is committed to the binlog file it becomes durable on
the master and visible to the slaves.  When the transaction is committed
to the storage engine, it becomes visible to other transactions on the
master.  It may be necessary to maintain the order of binlog and storage
engine commits to make sure they are applied in the same order on the
master and on the slaves, but again I don't see an ordering requirement
for the prepare.

I believe that the current prototype patch maintains the ordering of
binlog and innodb commits to the same degree that they were maintained
in MySQL 4.x.  It does not maintain the ordering of prepares and
commits.

Example A: (allowed by prototype)

   thread1> prepare; ...
            <context switch>
   thread2> prepare; write to binlog; commit;
   thread1>          write to binlog; commit;

Example B: (not allowed by prototype)

   thread1> ... write to binlog; ...
            <context switch>
   thread2> ... write to binlog; commit;
   thread1>                      commit;

If Example A is allowed by both replication and hot backups, we have
a potentially large performance gain with the current prototype patch.

If Example A is not allowed by either or both replication or hot
backups, we will need a more complicated solution than the current
proposal.

If Example A or Example B is not allowed by replication, that needs to
be documented because it affects all transactional storage engines, not
just Innodb.

If Example A or Example B is not allowed by Innodb hot backups, it would
be beneficial to have a better description of the reason for the
constraint, so we can determine in the future if the constraint is
lifted.

David, 

Makes sense. Though MySQL 4.x did not have concept of prepare so it only maintained the full order.

The challenge it seems to add there could be many transactions left in prepared state at the time of crash.. though I assume recovery will be smart enough to roll them back if there is no commit in the binary log file.

This also should not cause the problems with LVM backup the backup I'm concerned about.

I would suggest you getting quick feedback from Heikki and if Innodb Hot Backup does not support it make an option to disable this optimization.

99% of people out there do not use Innodb Hot Backup so there is no reason everybody must suffer.

Peter - from what I can see, it appears that there can be an arbitrary number of prepared transactions at the time of a crash, and any that do not have a matching commit in the binary log will be rolled back.

Inaam - can you comment on the dependencies in hot backup and ask Heikki to take a look at this if needed?

David,

We are presently focused on the stability of V5.1 and the next release of the InnoDB Plugin and do not have time right now to look at this.  We cannot at this stage of the release of either product  undertake tasks that are essentially about feature requests that most likely cannot be incorporated in 5.0 or 5.1, according to MySQL and Innobase product quality standards.

Sorry.

David,

Do you have stable version of the patch right now ?
We think it is important performance issue for number of users and we would be very happy to include it as an option to Percona MySQL patch set. 

I assume it should be possible to have an option to keep old safe logic for hot backup and other compatibility one may require.

Peter,

The current version of the patch is available as an attachment to Bug#38501.  Note that the version pasted in as a comment early in the discussion is not the latest.

I have discussed this patch with a number of people within MySQL, including some members of the MySQL replication team, but it has not gone through formal review or QA at this time.  It is currently considered a prototype patch from the Performance Technologies team at Sun, which is where I work.

If you decide to review and/or test the patch, please provide feedback here.

Thanks

This is a serious performance regression from 4.1. Why was this changed to a feature request?

Fixed in plugin 1.0.4.

Documentation and source/binaries available at innodb.com

How has this been fixed in the 1.0.4 plugin? Group commit is not done for InnoDB when the binlog is enabled as prepare_commit_mutex is locked during the time at which the binlog write/flush/fsync is done.

For proof that this is not done:
1) Mats has written blog articles on how to fix it
2) Kristian has done the same for MariaDB
3) Facebook patch has patches to fix it

group commit as implemented in 1.0.4 only works when sync_binlog == 0.
The case of sync_binlog == 1 is still outstanding.

If you think that is sufficient to claim it has been fixed then the docs should include that disclaimer and explain the implication of running in that mode -- no crash safety.

I think that Postgres fans will have a field day with this.

not sure how well the implications of sync_binlog == 0 are documented.
It is I believe crash safe as far as mysqld crash is concerned, by virtue
of OS file sync on process death. OS or hardware crash is a different matter.
The history of why it was decided it was sufficient to cover only 
sync_binlog == 1 at 1.0.4 time I can't recall.
I assume 'if you think' in the previous note referred to the organization,
I was merely stating a fact.
Either way I have no comments to share here on what I personally think, sorry.
I'll leave the bug in the most capable hands of the bug owner.

Mark - agree with you that the limitation should be clearly documented. Thanks for pointing it out!

There have been a lot of discussion among Mats, Inaam, and others on how to fix the GC in the case of sync_binlog == 1. But it will take some time.

Mark,

The fact that this bug is reported as 'group commit is broken in 5.0' implies that there was a 'group commit' before XA was introduced. From what I have gathered from the code the only way to have group commit was when either binlogging was completely disabled or sync_binlog == 0. The feedback from Heikki was also that there are a lot of mysql installations with sync_binlog = 0 (which is also the default value and is probably rarely touched by many users). So for all such installations GC is back as a feature with plugin 1.0.4.

For such installations (like the ones at FB) where sync_binlog = 1 more work is needed.

I do agree that we should mention this in docs.

Relate to http://bugs.mysql.com/bug.php?id=49326

Also note that the choice is not only sync_binlog = 0 or sync_binlog = 1. Sync_binlog = 10 or 50 or more are values that can be and are used in a way somewhat similar to innodb_flush_log_at_trx_commit = 2, which is a popular setting.