| Bug #58787 | InnoDB enforcing order of DDL and DML in binary log? | ||
|---|---|---|---|
| Submitted: | 7 Dec 2010 15:18 | Modified: | 18 Dec 2010 13:50 | 
| Reporter: | Matthew Lord | Email Updates: | |
| Status: | Verified | Impact on me: | |
| Category: | MySQL Server: Locking | Severity: | S1 (Critical) | 
| Version: | OS: | Any | |
| Assigned to: | CPU Architecture: | Any | |
   [7 Dec 2010 15:18]
   Matthew Lord        
  
 
   [7 Dec 2010 17:01]
   Inaam Rana        
  (11:55:10 AM) inaam1: Heikki: question about xa and DDL (11:55:27 AM) Heikki: Please ask (11:55:42 AM) inaam1: when running a DDL we don't do innobase_xa_prepare() (11:56:10 AM) Heikki: Hmm... I think we should do (11:56:14 AM) inaam1: how do we then ensure ordering of commit between binlog and redo (or does it not matter in case of DDL) (11:56:20 AM) inaam1: http://bugs.mysql.com/bug.php?id=58787 (11:56:38 AM) Heikki: It does matter, of course (11:57:09 AM) Heikki: Since MySQL commits the DDL quickly, the window for errors is short (11:57:47 AM) inaam1: but it is a bug neverthless (11:57:48 AM) Heikki: It should enforce the ordering (11:58:19 AM) inaam1: hmm...so that is something that needs to be fixed in mysql code (11:58:48 AM) Heikki: Yes, I think MySQL should issue the PREPARE (12:00:05 PM) inaam1: ok. I'll ask Matt to assign the bug to mysql team that deals with xa trxs
   [20 Dec 2010 8:53]
   Mats Kindahl        
  I agree with Heikki that this is something that needs to be fixed at the server level not at the storage engine level.
   [7 Jun 2011 19:32]
   Dmitry Lenev        
  Hello! Further investigation has shown that we can view ALTER TABLE as consisting of two phases: During the first phase we open table being altered, create temporary table representing new version of the table and copy data to this table from the old version. This phase ends with enabling back transactions (which were disabled for duration of copying) which also does commit. Both stacktraces which you see in the original report represent exactly this commit. The first call to ha_trans_commit() is for statement transaction and the second is for the global transaction. The latter in this case is probably empty due to autocommit mode being on. Note that since nothing is written to binary log during this phase there is only one engine participating in this transaction - InnoDB (hence rw_ha_count == 1). Because of this there is no need for two-phase commit and server takes a shortcut by performing one phase commit. Since nothing is written to the binary log during this phase/in this transaction there should be no problem with ordering of binary-log/ redo-log records for other transactions. After all the situation is no different from use-case when row-based binary logging is used and we are committing transaction which has changed only temporary tables. During the second phase we swap new and old versions of the table, drop old version and, finally, write ALTER TABLE statement to the binary log. Swapping of old and new versions of table is achieved by two consecutive calls to ha_innobase::rename_table() method. Each of these calls creates its own transaction on InnoDB data dictionary which is committed within the same call. Similar thing happens in ha_innobase::delete_table() call which drops the old version of table. Finally ALTER TABLE is written to the binary log but directly and not through transactional cache. Indeed, the fact that changes to data-dictionary and write to binary log are non-atomic can cause problems. For example, crash between update to data-dictionary and write to binary log can create discrepancy between database state and binary log. But MySQL's DDL is known not to be crash-safe. So there is nothing really new here (also I think that this bug is not about this issue). Another scenario in which problem due to this non-atomicity can arise (and in which order of binary and redo logs is important) is when someone takes a backup using something like innodbhotbackup concurrently to a DDL statement, and then later uses this backup + binary log for Point-in-Time-Recovery or for slave bootstrapping. I have discussed possible issues with Backup Team and came to the following conclusions: 1) MySQL Enterprise Backup's mysqlbackup tool should not be affected by this problem in default mode. In this mode it tries to create a validity point while holding FLUSH TABLES WITH READ LOCK lock, so no concurrent DDL is possible at the time when critical step of backup is done. The same should be true for 3rd-party tools/scripts which acquire FTWRL for creation of backup's validity point. 2) MySQL Enterprise Backup's ibbackup tool, mysqlbackup tool in --innodb-only mode and possible 3rd-party tools that don't acquire FTWRL might be affected by this issue (although probability of bad things happenning is fairly small). Please note that MEB's manual clearly lists inability to guarantee correct backups while concurrent DDL is running as a known limitation and situation to avoid. See: http://dev.mysql.com/doc/mysql-enterprise-backup/3.5/en/bugs.backup.html So the latter is known and documented issue. Since proper fix for this problem is likely to require changing of our DDL statements to be performed and logged within one transaction and might involve transition from .FRM-files to real in-table data-dictionary it will, in my opinion, take significant amount of time to implement and can't be done in stable versions of server.

