MySQL Bugs: #36197: flush tables (or little table cache) can cause crash on slave

Bug #36197	flush tables (or little table cache) can cause crash on slave
Submitted:	18 Apr 2008 8:34	Modified:	24 May 2008 17:11
Reporter:	Jonas Oreland	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Row Based Replication ( RBR )	Severity:	S3 (Non-critical)
Version:	5.1	OS:	Any
Assigned to:	Mats Kindahl	CPU Architecture:	Any

Description:
start replication
on slave: while true; do mysql -uroot test "flush tables" > /dev/null; done
run transactions on master
wait up to 1 minute, and watch slave crash in mysterious ways.

problem is handling of ndb_apply_status
and can cause crashes all over, due to incorrect memory access
(both read/write it seems)

How to repeat:
.

Suggested fix:
.

run replication rowbased replication using either ndb or innodb.

for innodb you need to use multi update e.g. update t1,t2 set t1.b=t1.b+1,t2.b=t2.b+1;

in parallell run flush tables repeatadly, or have open_table_cache low to force releasing of open tables

slave will eventually crash

Easy to reproduce using ndb

Slave:

while true ; do mysql -e "flush tables" > /dev/null ; done

Master:

mysql -e "create table t1 (a int key, b int) engine ndb; insert into t1 values (1,1); create table t2 (a int key, b int) engine ndb; insert into t2 values (1,1);"
while true ; do mysql -e "update t1 set b=b+1; update t2 set b=b+1;" ; done

Can easily be reproduced right away by inserting a sleep in the slave apply code:

int Table_map_log_event::do_apply_event(Relay_log_info const *rli)
...
+    sql_print_information("opening table %s in 10 seconds", table_list->alias);
+    sleep(10);

    if ((error= open_tables(thd, &tmp_table_list, &count, 0)))
    {
...
    }
    sql_print_information("opened table %s", table_list->alias);

wait for the printout that one table has been opened....

and do "flush tables" on the slave, which will hang until all open tables have been done...

slave will either crash, or print very worrying warning printouts about corrupt links to mysql.err

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46384

ChangeSet@1.2572, 2008-05-06 15:03:30+02:00, mats@mats-laptop.(none) +8 -0
  BUG#36197: flush tables (or little table cache) can cause crash on slave
  
  When flushing tables, there were a slight chance that the flush was occuring
  between processing of two table map events. Since the tables are opened
  one by one, it might result in that the tables were not valid and that sub-
  sequent locking of tables would cause the slave to crash.
  
  The problem is solved by opening and locking all tables at once using
  simple_open_n_lock_tables(). Also, the patch contain a change to open_tables()
  so that pre-locking only takes place when the trg_event_map is zero, which
  was not the case before (this caused the lock to be placed in thd->locked_tables
  instead of thd->lock since the assumption was that triggers would be called
  later and therefore the tables should be pre-locked).

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46563

ChangeSet@1.2559, 2008-05-09 15:30:54+02:00, mats@mats-laptop.(none) +8 -0
  BUG#36197: flush tables (or little table cache) can cause crash on slave
  
  When flushing tables, there were a slight chance that the flush was occuring
  between processing of two table map events. Since the tables are opened
  one by one, it might result in that the tables were not valid and that sub-
  sequent locking of tables would cause the slave to crash.
  
  The problem is solved by opening and locking all tables at once using
  simple_open_n_lock_tables(). Also, the patch contain a change to open_tables()
  so that pre-locking only takes place when the trg_event_map is zero, which
  was not the case before (this caused the lock to be placed in thd->locked_tables
  instead of thd->lock since the assumption was that triggers would be called
  later and therefore the tables should be pre-locked).

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46633

ChangeSet@1.2559, 2008-05-12 19:50:53+02:00, mats@mats-laptop.(none) +8 -0
  BUG#36197: flush tables (or little table cache) can cause crash on slave
  
  When flushing tables, there were a slight chance that the flush was occuring
  between processing of two table map events. Since the tables are opened
  one by one, it might result in that the tables were not valid and that sub-
  sequent locking of tables would cause the slave to crash.
  
  The problem is solved by opening and locking all tables at once using
  simple_open_n_lock_tables(). Also, the patch contain a change to open_tables()
  so that pre-locking only takes place when the trg_event_map is not zero, which
  was not the case before (this caused the lock to be placed in thd->locked_tables
  instead of thd->lock since the assumption was that triggers would be called
  later and therefore the tables should be pre-locked).

Pushed into 5.1.24-ndb-6.2.15

Pushed into 5.1.24-ndb-6.3.15

triage: As this is a generic issue raise the 'I' level to I2 and as a result to a P1.

Pushed into 5.1.23-ndb-6.4.0

Pushed into 5.1.25-rc

Please verify the version numbers for this fix in the telco-6.2 and telco-6.3 trees, since 6.2.15 was cloned off (AFAICT) before this push was made and 6.3.14 has (AFAIK) not yet been cloned off or released.

Shouldn't the 6.2 and 6.3 versions be 6.2.16 and 6.3.14?

Thanks!

Okay, I figured it out... 6.2.16 and 6.3.15 are the correct NDB versions for the fix.

Documented fix in the 5.1.24-ndb-6.2.16, 5.1.24-ndb-6.3.15, and 5.1.25 changelogs as follows:

        When flushing tables, there were a slight chance that the flush occured
        between the processing of two table map events. Since the tables were opened
        one by one, subsequent locking of tables would cause the slave to crash.
        This problem was observed when replicating NDBCLUSTER or InnoDB tables,
        when executing multi-table updates, and when a trigger or a stored
        routine performed an (additional) insert on a table so that two tables
        were effectively being inserted into in the same statement.

Left in NDI status per Joro's request pending 6.0 merge.

Pushed into 6.0.6-alpha

Also documented in the 6.0.6 changelog; closed.

Pushed into 5.1.25-rc  (revid:sp1r-mats@mats-laptop.(none)-20080516125646-20320) (version source revid:sp1r-mats@mats-laptop.(none)-20080516125646-20320) (pib:3)