Bug #13669 | Group commit is broken in 5.0 | ||
---|---|---|---|
Submitted: | 30 Sep 2005 19:18 | Modified: | 12 Aug 2009 17:03 |
Reporter: | Peter Zaitsev (Basic Quality Contributor) | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: InnoDB storage engine | Severity: | S4 (Feature request) |
Version: | 5.0 | OS: | Any (all) |
Assigned to: | Inaam Rana | CPU Architecture: | Any |
Tags: | bfsm_2007_06_21 performance, bfsm_2007_06_28, qc |
[30 Sep 2005 19:18]
Peter Zaitsev
[8 Oct 2005 11:44]
Heikki Tuuri
---
[9 Feb 2006 15:37]
Heikki Tuuri
Assigning this fundamental task to Osku.
[2 Oct 2006 16:56]
Heikki Tuuri
Hi! The fix is probably relatively easy: we need to move the slow part of the commit, that is, the InnoDB log flush to disk, to a 'second phase' where we no longer hold any global mutex oe lock. That is how it worked in 4.1. Though, I have not thought about the XA implications in detail. If XA flushes its own binlog file to disk without a group commit, then we would need to implement a group commit to the binlog, too, so that the XA algorithm would be fundamentally right. The reason for the current bottleneck in 5.0 is that the transaction serialization order in ib_logfiles and in the binlog must be the same, so that a point-in-time recovery from an InnoDB hot backup works. Regards, Heikki
[2 Oct 2006 17:26]
Sergei Golubchik
Unfortunately, you cannot flush the logs only in the second phase, it must happen in _every_ phase. When prepare() returns a success it is a guarantee that data was written to a persistent storage and can be recovered and committed or rolled back after a crash. For example: Innodb: prepare, no flush to disk binlog: log the transaction, flush to disk (explicit, or implicit by OS) --- crash on recovery InnoDB cannot commit the transaction, because it's not in the logs :(
[3 Oct 2006 10:22]
Osku Salerma
What I don't understand is why people not using XA have to pay for the inevitable performance losses that come from additional guarantees made due to XA. Surely if XA is not used, group commit could work as it did in 4.1. If XA is used and it needs additional syncs or whatever, then so be it, but that's a price that XA users have to pay.
[23 Oct 2006 10:54]
Mads Martin Joergensen
Notes from meeting with Heikki and others: No changes needed in MySQL, Heikki says. When he gets around to it, he'll find out for sure--it's a resource issue. If he hits something where MySQL needs to do something, he'll let us know.
[21 Jun 2007 10:22]
Kevin Burton
Agreed Osku.... This seems like a bug that really needs to get fixed.
[11 Sep 2007 14:01]
Heikki Tuuri
Changing severity to Feature Request.
[10 Oct 2008 3:59]
David Lutz
Please see Bug#38501 "hold times on prepare_commit_mutex limit read/write scalability" for a proposed patch that addresses group commit and substantially improves scalability for write workloads. Bug#38501 was determined to be a duplicate of this bug, so I will follow up here with additional comments and questions.
[10 Oct 2008 4:02]
David Lutz
I've done quite a bit of additional review of the MySQL/Innodb code to try to verify whether my prototype patch for bugs 38501 and 13669 maintains the required ordering between the binlog file and the Innodb logs. Comments in the code indicate that Innodb hot backup has a requirement for the ordering of calls to binlog_commit() and innobase_commit(). After the additional code review, I am curious whether this requirement still exists. Do we have a good description of the specific ordering requirements for Innodb hot backup and why they exist? For example, I suspect that one reason for the ordering requirement was that in MySQL 4.1, on startup with --innodb-safe-binlog, a check was performed to determine if any transactions had been committed to the binlog that were not also committed to Innodb. If any were found, MYSQL_LOG::cut_spurious_tail() truncated the binlog file to match Innodb. If ordering between the binlog and Innodb was not maintained, we could have the following: thread1> write to binlog; ... <context switch> thread2> write to binlog; commit <crash> On startup, the last committed transaction in Innodb would appear in the binlog file after an uncommitted transaction, and a simple truncation of the binlog file could not be used to remove that entry. This behavior has been completely changed in 5.x, and the --innodb-safe-binlog is no longer needed. This raises the question of whether the ordering requirement between the binlog file and Innodb logs still exists, and if so, whether it extends to the prepare phase. If I understand the 4.1 commit behavior correctly, ha_commit_trans() had no prepare phase and the binlog LOCK_log lock was used to protect updates to both the binlog file and the Innodb logs. mysql_bin_log.write() acquired LOCK_log, then wrote the transaction's binlog cache to the binlog file, wrapped in a BEGIN/COMMIT pair, then called ha_report_binlog_offset_and_commit() while still holding LOCK_log. This ultimately called trx_commit_off_kernel() which set undo->state to one of TRX_UNDO_CACHED, TRX_UNDO_TO_FREE, or TRX_UNDO_TO_PURGE, then called trx_sys_update_mysql_binlog_offset() and mtr_commit(). Prior to the call to trx_commit_off_kernel(), trx->flush_log_later was set to TRUE so the Innodb log would not be flushed to disk. Then trx->flush_log_later was set to FALSE, LOCK_log was released, and ha_commit_complete() was called. This ultimately called trx_commit_complete_for_mysql() which flushed the Innodb log to disk. The key in 4.1 is that the update to the Innodb logs was not flushed to disk until after LOCK_log was released. This allowed for concurrent commits and also reduced hold time and contention on LOCK_log, allowing for better scalability while still maintaining the write ordering that was required at that time. The XA transaction support that was added for 5.x introduced a prepare phase to ha_commit_trans(). From what I can see, the prepare phase for binlog does nothing, but the prepare phase for Innodb now acquires the prepare_commit_mutex lock and then calls trx_prepare_for_mysql(). This ultimately calls trx_prepare_off_kernel() which sets undo->state to TRX_UNDO_PREPARED and undo->xid to trx->xid, then calls mtr_commit(). The Innodb log is then flushed to disk and ha_commit_one_phase() is called, which then calls binlog_commit() and innobase_commit() to perform the commit phase. This phase looks essentially the same as the 4.1 commit phase, writing the binlog file then setting undo->state to one of the same three states mentioned for 4.1, then calling trx_sys_update_mysql_binlog_offset() and mtr_commit(). We then flush the Innodb log to disk while still holding the prepare_commit_mutex lock. The binlog LOCK_log is now only used to serialize updates to the binlog file, and the prepare_mutex_lock now enforces ordering between the binlog and the Innodb log. Unfortunately, this may include two synchronous disk flushes while holding the lock, one for trx_prepare_off_kernel() and one for trx_commit_off_kernel(). The lock prevents concurrent commits and limits scalability. The changes I introduced in my prototype patch include moving the acquisition of prepare_commit_mutex to a point after trx_prepare_off_kernel(), and restoring the steps in 4.x of setting and restoring trx->flush_log_later to avoid a disk flush in trx_commit_off_kernel(). This is then followed by the release of prepare_commit_mutex and a call to trx_commit_complete_for_mysql(), much like what was done in 4.x. I am fairly confident that the changes related to the commit phase maintain the ordering of binlog_commit() and innobase_commit() as was done in 4.x. As I already mentioned, I am curious whether that is till required, and whether it extends to the prepare phase. The 5.x XA support changed the behavior of database recovery on startup, and the call to MYSQL_LOG::cut_spurious_tail() is no longer used. My understanding is that on startup, MySQL calls TC_LOG_BINLOG::recover to scan the binlog for events with type XID_EVENT, then call the storage engine xarecover code to look for events that are still in a prepared state. MySQL then calls the storage engine commit_by_xid code to commit these transactions. When TC_LOG_BINLOG::recover is complete, MySQL calls the storage engine xarecover code a second time to look for transactions that are still in the prepared state, which means that they made it through the prepare phase but did not get committed to the binlog file. MySQL then calls the storage engine rollback_by_xid code to roll back these transactions. In essence, the binlog file appears to be the source of record in MySQL 5.x for transactions that must be committed to the database. If a transaction is prepared but not committed to the binlog, it will be rolled back during recovery. If it is prepared and committed to the binlog, but not committed to the storage engine, it will be committed during recovery. If it was prepared and committed to both the binlog and the storage engine, it will remain committed. Given all of the above, I have the following basic questions: Are there any fundamental flaws in my understanding of the 4.x and 5.1 commit and startup/recovery processing? What are the specific binlog/Innodb ordering requirements for Innodb hot backup and why do they exist? Are they still relevant given the changes to commit processing and startup recovery? If not, do we need the prepare_commit_mutex lock at all? If the ordering is still required, do the changes I have proposed satisfy the requirements, or do we need to maintain ordering with the prepare phase as well? Thanks!
[10 Oct 2008 22:39]
Peter Zaitsev
David, It is not only question of Innodb Hot Backup. Generally (assuming sync-binlog=1 is defined) we must have replication log in sync with transactional log. Otherwise in case of master crash slave can be potentially inconsistent, LVM backup may not work etc. It is great to see this bug is somethat fixed but it is very important to ensure fix is safe.
[11 Oct 2008 1:43]
David Lutz
Peter, Thanks for the feedback. I'm not sure yet if there is a replication related requirement for the ordering of the binlog and storage engine commits, but I don't currently think that it extends to the prepare phase. A prepare is invisible to other transactions, even on the master, until it is committed. Since the prepare is effectively invisible, I'm not sure why the order of prepares relative to commits would matter. Once the transaction is committed to the binlog file it becomes durable on the master and visible to the slaves. When the transaction is committed to the storage engine, it becomes visible to other transactions on the master. It may be necessary to maintain the order of binlog and storage engine commits to make sure they are applied in the same order on the master and on the slaves, but again I don't see an ordering requirement for the prepare. I believe that the current prototype patch maintains the ordering of binlog and innodb commits to the same degree that they were maintained in MySQL 4.x. It does not maintain the ordering of prepares and commits. Example A: (allowed by prototype) thread1> prepare; ... <context switch> thread2> prepare; write to binlog; commit; thread1> write to binlog; commit; Example B: (not allowed by prototype) thread1> ... write to binlog; ... <context switch> thread2> ... write to binlog; commit; thread1> commit; If Example A is allowed by both replication and hot backups, we have a potentially large performance gain with the current prototype patch. If Example A is not allowed by either or both replication or hot backups, we will need a more complicated solution than the current proposal. If Example A or Example B is not allowed by replication, that needs to be documented because it affects all transactional storage engines, not just Innodb. If Example A or Example B is not allowed by Innodb hot backups, it would be beneficial to have a better description of the reason for the constraint, so we can determine in the future if the constraint is lifted.
[11 Oct 2008 3:06]
Peter Zaitsev
David, Makes sense. Though MySQL 4.x did not have concept of prepare so it only maintained the full order. The challenge it seems to add there could be many transactions left in prepared state at the time of crash.. though I assume recovery will be smart enough to roll them back if there is no commit in the binary log file. This also should not cause the problems with LVM backup the backup I'm concerned about. I would suggest you getting quick feedback from Heikki and if Innodb Hot Backup does not support it make an option to disable this optimization. 99% of people out there do not use Innodb Hot Backup so there is no reason everybody must suffer.
[15 Oct 2008 19:24]
David Lutz
Peter - from what I can see, it appears that there can be an arbitrary number of prepared transactions at the time of a crash, and any that do not have a matching commit in the binary log will be rolled back. Inaam - can you comment on the dependencies in hot backup and ask Heikki to take a look at this if needed?
[15 Oct 2008 22:33]
Inaam Rana
David, We are presently focused on the stability of V5.1 and the next release of the InnoDB Plugin and do not have time right now to look at this. We cannot at this stage of the release of either product undertake tasks that are essentially about feature requests that most likely cannot be incorporated in 5.0 or 5.1, according to MySQL and Innobase product quality standards. Sorry.
[16 Oct 2008 20:52]
Peter Zaitsev
David, Do you have stable version of the patch right now ? We think it is important performance issue for number of users and we would be very happy to include it as an option to Percona MySQL patch set. I assume it should be possible to have an option to keep old safe logic for hot backup and other compatibility one may require.
[17 Oct 2008 20:19]
David Lutz
Peter, The current version of the patch is available as an attachment to Bug#38501. Note that the version pasted in as a comment early in the discussion is not the latest. I have discussed this patch with a number of people within MySQL, including some members of the MySQL replication team, but it has not gone through formal review or QA at this time. It is currently considered a prototype patch from the Performance Technologies team at Sun, which is where I work. If you decide to review and/or test the patch, please provide feedback here. Thanks
[7 Aug 2009 16:53]
Mark Callaghan
This is a serious performance regression from 4.1. Why was this changed to a feature request?
[12 Aug 2009 17:03]
Inaam Rana
Fixed in plugin 1.0.4. Documentation and source/binaries available at innodb.com
[7 Oct 2010 7:39]
Mark Callaghan
How has this been fixed in the 1.0.4 plugin? Group commit is not done for InnoDB when the binlog is enabled as prepare_commit_mutex is locked during the time at which the binlog write/flush/fsync is done. For proof that this is not done: 1) Mats has written blog articles on how to fix it 2) Kristian has done the same for MariaDB 3) Facebook patch has patches to fix it
[7 Oct 2010 13:10]
Mikhail Izioumtchenko
group commit as implemented in 1.0.4 only works when sync_binlog == 0. The case of sync_binlog == 1 is still outstanding.
[7 Oct 2010 14:11]
Mark Callaghan
If you think that is sufficient to claim it has been fixed then the docs should include that disclaimer and explain the implication of running in that mode -- no crash safety. I think that Postgres fans will have a field day with this.
[7 Oct 2010 14:27]
Mikhail Izioumtchenko
not sure how well the implications of sync_binlog == 0 are documented. It is I believe crash safe as far as mysqld crash is concerned, by virtue of OS file sync on process death. OS or hardware crash is a different matter. The history of why it was decided it was sufficient to cover only sync_binlog == 1 at 1.0.4 time I can't recall. I assume 'if you think' in the previous note referred to the organization, I was merely stating a fact. Either way I have no comments to share here on what I personally think, sorry. I'll leave the bug in the most capable hands of the bug owner.
[7 Oct 2010 14:41]
Calvin Sun
Mark - agree with you that the limitation should be clearly documented. Thanks for pointing it out! There have been a lot of discussion among Mats, Inaam, and others on how to fix the GC in the case of sync_binlog == 1. But it will take some time.
[7 Oct 2010 15:14]
Inaam Rana
Mark, The fact that this bug is reported as 'group commit is broken in 5.0' implies that there was a 'group commit' before XA was introduced. From what I have gathered from the code the only way to have group commit was when either binlogging was completely disabled or sync_binlog == 0. The feedback from Heikki was also that there are a lot of mysql installations with sync_binlog = 0 (which is also the default value and is probably rarely touched by many users). So for all such installations GC is back as a feature with plugin 1.0.4. For such installations (like the ones at FB) where sync_binlog = 1 more work is needed. I do agree that we should mention this in docs.
[7 Oct 2010 16:21]
Calvin Sun
Relate to http://bugs.mysql.com/bug.php?id=49326
[8 Oct 2010 2:49]
James Day
Also note that the choice is not only sync_binlog = 0 or sync_binlog = 1. Sync_binlog = 10 or 50 or more are values that can be and are used in a way somewhat similar to innodb_flush_log_at_trx_commit = 2, which is a popular setting.