Bug #4124 Replication Optimization
Submitted: 14 Jun 2004 7:16 Modified: 21 Jun 2004 10:28
Reporter: Sergei Kulakov (Candidate Quality Contributor) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Replication Severity:S4 (Feature request)
Version:All OS:Any (All)
Assigned to: CPU Architecture:Any

[14 Jun 2004 7:16] Sergei Kulakov
Description:
binlogs: I studied binlogs in MySQL and have a couple of thoughts on them. 
1. Excessive format. Every query has an extra command SET TIMESTAMP=N to syncronize the time between the replicating client and the server. I don't think this is necessary with every command. In order to replicate timestamp and auto_increment values, queries like 

Insert Into T Values(Now(), ...) 

should be changed to 

Insert Into T Values(YYYYMMDDHHMMSS, ...)

and queries like 

Insert Into T Values(Null, ...)

should be changed to 

Insert Into T Values(N, ...)

where N is the inserted value. In other words, whenever possible, functions like Now() and special values like Null should be changed to the real values used for executing the queries. Although most updating queries use literal values, this also could be done for all functions, not for the special cases like these only, in order to make the binlog smaller and to decrease the slave load. If a query has a complicated expression resulting in a simple value, the expression should be changed for it. Though this is probably only possible for expressions not including table columns.

2. Transactions - there is an opinion in the internet (http://www.sitepoint.com/forums/showthread.php?s=&postid=109868) that MySQL can't provide safe transactions even with BDB and InnoDB tables as it uses operating system buffering for writing data to tables. In fact, this is not a problem, as in the case of a failure it is the binlog that is supposed to have all the data lost since the last moment the operating system flushed data to disk. One is supposed to apply the binlog to the database and get its final state it had at the moment of the failure. But this is only possible if the binlog itself is flushed after every query. If it isn't, then it won't contain the last queries either. The question is whether the binlog is flushed or not. Playing with MySQL 4.0.18 for Windows, I found it is not. I edit data in a table, open the binlog, see the query that changed the data, then press "Reset", and after the reboot I don't see either the new data or the query in the binlog. 

How to repeat:
See description

Suggested fix:
See description
[14 Jun 2004 10:08] Guilhem Bichot
Hi,

> 1. Excessive format. Every query has an extra command SET TIMESTAMP=N to
> syncronize the time between the replicating client and the server.

Not exactly. Every query is stored with 4 bytes which contain the timestamp value. mysqlbinlog converts those 4 bytes to a SET TIMESTAMP=N value.
By the way, the timestamp is, for every query, a vital information for many of our customers when then need to do point-in-time recovery.

> whenever possible,
> functions like Now() and special values like Null should be changed to the
> real values used for executing the queries. Although most updating queries
> use literal values, this also could be done for all functions, not for the
> special cases like these only, in order to make the binlog smaller and to
> decrease the slave load. If a query has a complicated expression resulting
> in a simple value, the expression should be changed for it. Though this is
> probably only possible for expressions not including table columns.

Suggestion makes sense, we call that row-level format. Whereas the current binary log format is query-level. With row-level format, we will store the exact values of each column. But we will also continue featuring query-level format, because it saves bandwidth in some cases.
And... substitution is not always possible, even if expressions don't include table columns. Like
INSERT INTO T SELECT UUID() FROM U;
if U has more than one row, this is not equivalent to a substituted query; you would need to write one substituted query per row.

> 2. Transactions - there is an opinion in the internet
> (http://www.sitepoint.com/forums/showthread.php?s=&postid=109868) that
> MySQL can't provide safe transactions even with BDB and InnoDB tables as
> it uses operating system buffering for writing data to tables. In fact,
> this is not a problem, as in the case of a failure it is the binlog that
> is supposed to have all the data lost since the last moment the operating
> system flushed data to disk.

No, it's not supposed to. InnoDB for example has a more robust failure handling than the binlog.
In MySQL 4.1 we are now adding a consistency feature between InnoDB and the binary log. So that when a transaction has been rolled back by InnoDB at crash recovery, it is automatically removed from the binary log.

> But
> this is only possible if the binlog itself is flushed after every query.

In MySQL 4.1.3 (to be released soon) we have added --sync-binlog=N which will cause the binlog to be fsync()ed every Nth query.
[14 Jun 2004 10:13] Guilhem Bichot
Complement to my answer on the timestamp: for an overview of how many bytes are used for what, you can check the internals.texi document which is included in our source code. Timestamp uses only 4 bytes; the string "SET TIMESTAMP" is not in the binary log; like "SET INSERT_ID" and some other SETs.
[16 Jun 2004 6:16] Sergei Kulakov
Thanks for the answer. Of course, timestamp is necessary for every query, no matter if SET TIMESTAMP=N is executed or not. As for the transactions topic, I just meant that binlog could be used not only for replication, but also for recovery after a failure. Even though its initial purpose is replication, it almost resolves this problem as well. At least, I know another DBMS which uses a kind of a journal, too for this very purpose. And there it flushes data to disk every time it is changed, but it does it not at files level, but at disk level, that is it does it very efficiently (writes sectors, not clusters). The MySQL binlog needs only one thing to serve this purpose too - it is flushing the log after every query. Then a server could be able to recover itself after a failure even without another slave.
[16 Jun 2004 15:40] Guilhem Bichot
Hi,
about your valid concern "binary log is not flushed", isn't it solved by what I already wrote:
"In MySQL 4.1.3 (to be released soon) we have added --sync-binlog=N
which will cause the binlog to be fsync()ed every Nth query."
--sync-binlog=1 ...
?
[18 Jun 2004 5:59] Sergei Kulakov
Sure it does and that's great. One now can use --sync-binlog=1 and some simple script that would determine, at MySQL startup (probably in mysqld_safe or what it is), that a failure took place and then automatically "replay" the binlog (its last part, the time is determined as a maximum time between OS flushes).
[21 Jun 2004 10:28] Guilhem Bichot
Hello,

Yes, I have a patch ready (will be into 4.1.3 or 4.1.4) where the binary log will always be consistent with the content of the InnoDB tables.
It will be slightly different from what you propose: instead of replaying what is in the binary log and not in the InnoDB tables (imagine there was a crash while a transaction was early in committing and so the transaction has been rolled back at restart), it will remove the transaction from the binary log. That means, the reference is the content of InnoDB, not the content of the binary log. What we call "the commit" is the commit into InnoDB; not the write to the binary log.
From a theoretical point of view, it will not be different from what you propose (which we consider also, for the future): as the crash occurred during COMMIT, the client hasn't received the ok, so he cannot expect that the transaction will be committed, neither can he expect that will be rolled back; it all depends on the millisecond where the crash happened. He can only check at restart. With your proposal, at restart he will see it has been committed; with what we are implementing he will see it has been rolled back. Both are correct from an ACID / standards point of view.

Regards,
Guilhem