Bug #47899 CREATE TABLE...SELECT binlogged wrongly if binlog_format=row
Submitted: 7 Oct 18:16
Reporter: Sven Sandberg
Status: Verified
Category:Server: RBR Severity:S2 (Serious)
Version:5.1+ OS:Any
Assigned to: Alfranio Correia Target Version:5.1+
Tags: binlog, CREATE...SELECT
Triage: Triaged: D2 (Serious)

[7 Oct 18:16] Sven Sandberg
Description:
When binlog_format=ROW, CREATE...SELECT is written as a CREATE statement without SELECT,
followed by row events. This entire sequence of binlog events is wrapped in
BEGIN...COMMIT. Like this:

  BEGIN
  CREATE TABLE t1 (a INT)
  row event
  row event
  row event
  COMMIT

However, CREATE table has an implicit commit. So the row events will be written outside
of transactional context. The slave will have intermediate states that never existed on
the master (namely, t1 exists with only some of the rows from the SELECT).

How to repeat:
--source include/have_binlog_format_row.inc
CREATE TABLE t1 (a INT);
INSERT INTO t1 VALUES (1);
CREATE TABLE t2 SELECT * FROM t1;
SHOW BINLOG EVENTS;

Suggested fix:
(1) Quick but incomplete fix: Write BEGIN after CREATE TABLE instead of before. Like
this:

  CREATE TABLE t1 (a INT)
  BEGIN
  row event
  row event
  row event
  COMMIT

This will at least group the row events together, so it is slightly better. However, the
slave will still have an intermediate state that never existed on the master (namely, t1
exists and is empty).

(2) Complete fix: One way that may work is like this. When logging CREATE TABLE that
comes from CREATE TABLE...SELECT, set a special flag in the query_log_event. When the SQL
thread sees the special flag, it does not execute the CREATE TABLE directly. Instead, it
saves the CREATE TABLE and, for all row events up to the next COMMIT, parses the event
and generates rows in the SELECT query. At COMMIT, execute the generated query. This
isn't pretty, there may be better ways.
[13 Oct 11:00] Mats Kindahl
The suggested approach will seriously hog memory for large sets of rows, so I would
suggest the following scheme that will not require caching of rows.

As Sven suggests, add a special flag to the CREATE statement to see that it is part of a
CREATE-SELECT. Keep the order with the BEGIN before the CREATE.

When the CREATE is executed on the slave, do the following:

- Since we have a flag, we can avoid committing the transaction before starting
  execution of the CREATE statement.

- Before releasing LOCK_open, lock the table for insert. This will prevent any
  other thread from being able to open the table for reading or writing in the
  brief moment between the CREATE statement and the following rows to insert.

When the row events now start to arrive the table will already be locked for insert, so
this might require special handling. Otherwise, the rest of the procedure for executing
the insert is as before.

This will not work when the binary log is dumped using mysqlbinlog unless the flag is
added to the statement. However, I do not think this will be necessary.
[13 Oct 12:30] Sven Sandberg
I agree, Mats suggestion seems more practical than the one I had. Just one comment: note
that the rows may not have arrived to the relay log when the slave SQL thread reads the
CREATE TABLE. If someone cuts the connection between master and slave after the IO thread
has received CREATE TABLE and before all the rows have been received, the table stays
locked indefinitely (or until some lock timeout).

Should we make the SQL thread wait for the IO thread to receive the COMMIT event? I think
it's doable - the IO thread can store the relay log position of the last received COMMIT
event in an integer variable, and the SQL thread can wait until the value of this
variable is bigger than the relay log position of the CREATE TABLE.
[14 Oct 17:24] Sven Sandberg
On second thought, please ignore my previous comment. Since the table is new on the slave,
it's not problematic to have it locked. Also, this isn't different from any other
applications of row events.