Bug #47899 CREATE TABLE...SELECT binlogged wrongly if binlog_format=row
Submitted: 7 Oct 2009 16:16 Modified: 30 Aug 2010 3:17
Reporter: Sven Sandberg Email Updates:
Status: In progress Impact on me:
None 
Category:MySQL Server: Row Based Replication ( RBR ) Severity:S2 (Serious)
Version:5.1+ OS:Any
Assigned to: Assigned Account
Tags: binlog, CREATE...SELECT
Triage: Triaged: D2 (Serious) / R3 (Medium) / E3 (Medium)

[7 Oct 2009 16:16] Sven Sandberg
Description:
When binlog_format=ROW, CREATE...SELECT is written as a CREATE statement without SELECT, followed by row events. This entire sequence of binlog events is wrapped in BEGIN...COMMIT. Like this:

  BEGIN
  CREATE TABLE t1 (a INT)
  row event
  row event
  row event
  COMMIT

However, CREATE table has an implicit commit. So the row events will be written outside of transactional context. The slave will have intermediate states that never existed on the master (namely, t1 exists with only some of the rows from the SELECT).

How to repeat:
--source include/have_binlog_format_row.inc
CREATE TABLE t1 (a INT);
INSERT INTO t1 VALUES (1);
CREATE TABLE t2 SELECT * FROM t1;
SHOW BINLOG EVENTS;

Suggested fix:
(1) Quick but incomplete fix: Write BEGIN after CREATE TABLE instead of before. Like this:

  CREATE TABLE t1 (a INT)
  BEGIN
  row event
  row event
  row event
  COMMIT

This will at least group the row events together, so it is slightly better. However, the slave will still have an intermediate state that never existed on the master (namely, t1 exists and is empty).

(2) Complete fix: One way that may work is like this. When logging CREATE TABLE that comes from CREATE TABLE...SELECT, set a special flag in the query_log_event. When the SQL thread sees the special flag, it does not execute the CREATE TABLE directly. Instead, it saves the CREATE TABLE and, for all row events up to the next COMMIT, parses the event and generates rows in the SELECT query. At COMMIT, execute the generated query. This isn't pretty, there may be better ways.
[13 Oct 2009 9:00] Mats Kindahl
The suggested approach will seriously hog memory for large sets of rows, so I would suggest the following scheme that will not require caching of rows.

As Sven suggests, add a special flag to the CREATE statement to see that it is part of a CREATE-SELECT. Keep the order with the BEGIN before the CREATE.

When the CREATE is executed on the slave, do the following:

- Since we have a flag, we can avoid committing the transaction before starting
  execution of the CREATE statement.

- Before releasing LOCK_open, lock the table for insert. This will prevent any
  other thread from being able to open the table for reading or writing in the
  brief moment between the CREATE statement and the following rows to insert.

When the row events now start to arrive the table will already be locked for insert, so this might require special handling. Otherwise, the rest of the procedure for executing the insert is as before.

This will not work when the binary log is dumped using mysqlbinlog unless the flag is added to the statement. However, I do not think this will be necessary.
[13 Oct 2009 10:30] Sven Sandberg
I agree, Mats suggestion seems more practical than the one I had. Just one comment: note that the rows may not have arrived to the relay log when the slave SQL thread reads the CREATE TABLE. If someone cuts the connection between master and slave after the IO thread has received CREATE TABLE and before all the rows have been received, the table stays locked indefinitely (or until some lock timeout).

Should we make the SQL thread wait for the IO thread to receive the COMMIT event? I think it's doable - the IO thread can store the relay log position of the last received COMMIT event in an integer variable, and the SQL thread can wait until the value of this variable is bigger than the relay log position of the CREATE TABLE.
[14 Oct 2009 15:24] Sven Sandberg
On second thought, please ignore my previous comment. Since the table is new on the slave, it's not problematic to have it locked. Also, this isn't different from any other applications of row events.
[31 Dec 2009 2:28] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/95830

3294 Alfranio Correia	2009-12-31
      BUG#47899 CREATE TABLE...SELECT binlogged wrongly if binlog_format=row
      
      Sketch of a patch.
[30 Jul 2010 10:44] Luis Soares
See also: BUG#55625.
[11 Aug 2010 11:09] Libing Song
See Bug#55877.
[11 Aug 2010 13:41] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/115496

3478 Li-Bing.Song@sun.com	2010-08-11
      Bug#47899 CREATE TABLE...SELECT binlogged wrongly if binlog_format=row
      
      When binlog_format=ROW, CREATE...SELECT is written as a CREATE statement without SELECT,
      followed by row events. This entire sequence of binlog events is wrapped in
      BEGIN...COMMIT. Like this:
        BEGIN
        row event
        row event
        CREATE TABLE t1 (a INT)
        row event
        row event
        row event
        COMMIT/ROLLBACK
      However, CREATE table has an implicit commit. So the row events will be
      written outside of transactional context.
      
      After this patch, on slave SQL thread, CREATE TABLE statement in a
      explicit transaction never ends the transaction automatically and
      also is binlogged in the transaction(same with master's binlog).
      As there is no 'CREATE TABLE' statement binlogged into a transaction
      except 'CREATE TABLE SELECT' statement.
[9 Nov 2010 19:44] Bugs System
Pushed into mysql-5.5 5.5.7-rc (revid:sunanda.menon@sun.com-20101109182959-otkxq8vo2dcd13la) (version source revid:jimmy.yang@oracle.com-20100804103744-vbpeghipkz6pyc9z) (merge vers: 5.1.51) (pib:21)
[13 Nov 2010 16:05] Bugs System
Pushed into mysql-trunk 5.6.99-m5 (revid:alexander.nozdrin@oracle.com-20101113155825-czmva9kg4n31anmu) (version source revid:jimmy.yang@oracle.com-20100804103744-vbpeghipkz6pyc9z) (merge vers: 5.1.51) (pib:21)
[13 Nov 2010 16:31] Bugs System
Pushed into mysql-next-mr (revid:alexander.nozdrin@oracle.com-20101113160336-atmtmfb3mzm4pz4i) (version source revid:jimmy.yang@oracle.com-20100804103744-vbpeghipkz6pyc9z) (pib:21)
[30 Apr 2014 11:36] Daniƫl van Eeden
It looks like this bug should have been closed?