Bug #36197 flush tables (or little table cache) can cause crash on slave
Submitted: 18 Apr 2008 8:34 Modified: 24 May 2008 17:11
Reporter: Jonas Oreland Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Row Based Replication ( RBR ) Severity:S3 (Non-critical)
Version:5.1 OS:Any
Assigned to: Mats Kindahl CPU Architecture:Any
Triage: D1 (Critical) / R2 (Low) / E2 (Low)

[18 Apr 2008 8:34] Jonas Oreland
Description:
start replication
on slave: while true; do mysql -uroot test "flush tables" > /dev/null; done
run transactions on master
wait up to 1 minute, and watch slave crash in mysterious ways.

problem is handling of ndb_apply_status
and can cause crashes all over, due to incorrect memory access
(both read/write it seems)

How to repeat:
.

Suggested fix:
.
[21 Apr 2008 12:46] Tomas Ulin
run replication rowbased replication using either ndb or innodb.

for innodb you need to use multi update e.g. update t1,t2 set t1.b=t1.b+1,t2.b=t2.b+1;

in parallell run flush tables repeatadly, or have open_table_cache low to force releasing of open tables

slave will eventually crash
[21 Apr 2008 12:49] Tomas Ulin
Easy to reproduce using ndb

Slave:

while true ; do mysql -e "flush tables" > /dev/null ; done

Master:

mysql -e "create table t1 (a int key, b int) engine ndb; insert into t1 values (1,1); create table t2 (a int key, b int) engine ndb; insert into t2 values (1,1);"
while true ; do mysql -e "update t1 set b=b+1; update t2 set b=b+1;" ; done

Can easily be reproduced right away by inserting a sleep in the slave apply code:

int Table_map_log_event::do_apply_event(Relay_log_info const *rli)
...
+    sql_print_information("opening table %s in 10 seconds", table_list->alias);
+    sleep(10);

    if ((error= open_tables(thd, &tmp_table_list, &count, 0)))
    {
...
    }
    sql_print_information("opened table %s", table_list->alias);

wait for the printout that one table has been opened....

and do "flush tables" on the slave, which will hang until all open tables have been done...

slave will either crash, or print very worrying warning printouts about corrupt links to mysql.err
[6 May 2008 13:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46384

ChangeSet@1.2572, 2008-05-06 15:03:30+02:00, mats@mats-laptop.(none) +8 -0
  BUG#36197: flush tables (or little table cache) can cause crash on slave
  
  When flushing tables, there were a slight chance that the flush was occuring
  between processing of two table map events. Since the tables are opened
  one by one, it might result in that the tables were not valid and that sub-
  sequent locking of tables would cause the slave to crash.
  
  The problem is solved by opening and locking all tables at once using
  simple_open_n_lock_tables(). Also, the patch contain a change to open_tables()
  so that pre-locking only takes place when the trg_event_map is zero, which
  was not the case before (this caused the lock to be placed in thd->locked_tables
  instead of thd->lock since the assumption was that triggers would be called
  later and therefore the tables should be pre-locked).
[9 May 2008 13:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46563

ChangeSet@1.2559, 2008-05-09 15:30:54+02:00, mats@mats-laptop.(none) +8 -0
  BUG#36197: flush tables (or little table cache) can cause crash on slave
  
  When flushing tables, there were a slight chance that the flush was occuring
  between processing of two table map events. Since the tables are opened
  one by one, it might result in that the tables were not valid and that sub-
  sequent locking of tables would cause the slave to crash.
  
  The problem is solved by opening and locking all tables at once using
  simple_open_n_lock_tables(). Also, the patch contain a change to open_tables()
  so that pre-locking only takes place when the trg_event_map is zero, which
  was not the case before (this caused the lock to be placed in thd->locked_tables
  instead of thd->lock since the assumption was that triggers would be called
  later and therefore the tables should be pre-locked).
[12 May 2008 17:51] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/46633

ChangeSet@1.2559, 2008-05-12 19:50:53+02:00, mats@mats-laptop.(none) +8 -0
  BUG#36197: flush tables (or little table cache) can cause crash on slave
  
  When flushing tables, there were a slight chance that the flush was occuring
  between processing of two table map events. Since the tables are opened
  one by one, it might result in that the tables were not valid and that sub-
  sequent locking of tables would cause the slave to crash.
  
  The problem is solved by opening and locking all tables at once using
  simple_open_n_lock_tables(). Also, the patch contain a change to open_tables()
  so that pre-locking only takes place when the trg_event_map is not zero, which
  was not the case before (this caused the lock to be placed in thd->locked_tables
  instead of thd->lock since the assumption was that triggers would be called
  later and therefore the tables should be pre-locked).
[12 May 2008 19:55] Bugs System
Pushed into 5.1.24-ndb-6.2.15
[12 May 2008 20:23] Bugs System
Pushed into 5.1.24-ndb-6.3.15
[15 May 2008 16:03] Omer Barnir
triage: As this is a generic issue raise the 'I' level to I2 and as a result to a P1.
[19 May 2008 5:52] Bugs System
Pushed into 5.1.23-ndb-6.4.0
[19 May 2008 8:25] Bugs System
Pushed into 5.1.25-rc
[19 May 2008 12:55] Jon Stephens
Please verify the version numbers for this fix in the telco-6.2 and telco-6.3 trees, since 6.2.15 was cloned off (AFAICT) before this push was made and 6.3.14 has (AFAIK) not yet been cloned off or released.

Shouldn't the 6.2 and 6.3 versions be 6.2.16 and 6.3.14?

Thanks!
[20 May 2008 3:50] Jon Stephens
Okay, I figured it out... 6.2.16 and 6.3.15 are the correct NDB versions for the fix.

Documented fix in the 5.1.24-ndb-6.2.16, 5.1.24-ndb-6.3.15, and 5.1.25 changelogs as follows:

        When flushing tables, there were a slight chance that the flush occured
        between the processing of two table map events. Since the tables were opened
        one by one, subsequent locking of tables would cause the slave to crash.
        This problem was observed when replicating NDBCLUSTER or InnoDB tables,
        when executing multi-table updates, and when a trigger or a stored
        routine performed an (additional) insert on a table so that two tables
        were effectively being inserted into in the same statement.

Left in NDI status per Joro's request pending 6.0 merge.
[22 May 2008 9:50] Bugs System
Pushed into 6.0.6-alpha
[24 May 2008 17:11] Jon Stephens
Also documented in the 6.0.6 changelog; closed.
[28 Jul 2008 16:56] Bugs System
Pushed into 5.1.25-rc  (revid:sp1r-mats@mats-laptop.(none)-20080516125646-20320) (version source revid:sp1r-mats@mats-laptop.(none)-20080516125646-20320) (pib:3)