Bug #36578 Maria: maria-recover may fail to autorepair a table
Submitted: 7 May 2008 20:17 Modified: 29 Jun 2008 12:08
Reporter: Guilhem Bichot Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Maria storage engine Severity:S3 (Non-critical)
Version:5.1-maria OS:Any
Assigned to: Michael Widenius CPU Architecture:Any

[7 May 2008 20:17] Guilhem Bichot
Description:
Have not checked 5.0, please do.
Assume a crash of mysqld, having the effect that a MyISAM table has a positive open count and a corrupted data file.
Then after a restart, SELECT on the table will fail to autorepair the table: it will just print an error telling about corruption, but not manage to repair it. While an explicit REPAIR TABLE will repair it ok.

Reason is that in ha_myisam::check_and_repair(), there is a marked_crashed variable, whose logic is flawed:
  if ((marked_crashed= mi_is_crashed(file)) || check(thd, &check_opt))
  {
    sql_print_warning("Recovering table: '%s'",table->s->path.str);
    check_opt.flags=
      ((myisam_recover_options & HA_RECOVER_BACKUP ? T_BACKUP_DATA : 0) |
       (marked_crashed                             ? 0 : T_QUICK) |
       (myisam_recover_options & HA_RECOVER_FORCE  ? 0 : T_SAFE_REPAIR) |
       T_AUTO_REPAIR);
    if (repair(thd, &check_opt))
      error=1;
  }
when the top if() starts, table is not yet marked crashed (system only knows it has a positive open count, it has not checked the data file), so marked_crashed is false. check() finds the problem in the data file and returns true.
check_opt.flags gets T_QUICK, which cannot correct the data file, hence the repair failure (and an explicit REPAIR TABLE QUICK fails the same way).

How to repeat:
It is not obvious to forge a corrupted data file which will make the repair fail. In fact this bug was noticed in Maria when using Maria's specific rows-in-pages format. But the logics in Maria and in MyISAM are the same (marked_crashed variable) and we agree they both have to be fixed.

For these tests, always run your mysql command-line client with -A (so
that it does not open your tables behind the scenes).

Start mysqld with --myisam-recover.
create table t(a int) checksum=1;
insert into t values(1);

system killall -9 mysqld;
edit t.MAD and put some garbage which will make a quick repair fail (and I don't know how to do this! all my attempts failed; but of course this can happen in real life, if power goes away at the right moment).

restart mysqld
select * from t; # error telling about corruption
select * from t; # error telling about corruption: no autorepair was done!

Suggested fix:
Monty and I agree that the new code should just be:
  if (mi_is_crashed(file) || check(thd, &check_opt))
  {
    sql_print_warning("Recovering table: '%s'",table->s->path.str);
    check_opt.flags=
      ((myisam_recover_options & HA_RECOVER_BACKUP ? T_BACKUP_DATA : 0) |
       (myisam_recover_options & HA_RECOVER_FORCE  ? 0 : T_SAFE_REPAIR) |
       T_AUTO_REPAIR);
    if (repair(thd, &check_opt))
      error=1;
  }
i.e. remove the marked_crashed variable, never use T_QUICK.
[9 May 2008 9:01] Guilhem Bichot
I understand why I could not make it fail: ha_myisam::repair(), if failing with T_QUICK, is automatically retried without T_QUICK, so succeeds in the end.
Problem seems to be Maria-specific, where there is no retry.
I'm assigning this bug to me because I need to debug further before knowing what to fix.
[9 May 2008 9:05] Guilhem Bichot
One problem in Maria is that:
if I do:
create table t(a int, b varchar(100)) engine=maria transactional=0;
insert into t values(1, "lkjlkjlkjghgvhfhgfhgfhgf");
insert into t select * from t;
insert into t select * from t;
insert into t select * from t;
delete from t limit 3;
system killall -9 mysqld;
then edit t.MAD and modify one byte in the bitmap page, then restart mysqld and do
set global maria_recover=normal; select * from t;
(only in my tree, in vanilla tree, maria-recover is disabled),
then it just prints
ERROR 1034 (HY000): Incorrect key file for table 't'; try to repair it
This message is generated by the T_QUICK repair (maria_crashed is false), and there is no retry.
The corruption is actually seen as error 175 (file too short: data file indeed is shorter than what its bitmap page says); maybe error 175 on the data file should set retry_repair and T_RETRY_WITHOUT_QUICK??
Need to ask Monty.
[9 May 2008 9:43] Guilhem Bichot
Another problem, in ha_myisam::repair():
  while ((error=repair(thd,param,0)) && param.retry_repair)
  {
    param.retry_repair=0;
    if (test_all_bits(param.testflag,
		      (uint) (T_RETRY_WITHOUT_QUICK | T_QUICK)))
    {
      param.testflag&= ~T_RETRY_WITHOUT_QUICK;
      sql_print_information("Retrying repair of: '%s' without quick",
                            table->s->path.str);
      continue;
    }
    param.testflag&= ~T_QUICK;
isn't it strange that we "continue;" before removing T_QUICK?
[9 May 2008 9:48] Guilhem Bichot
In a previous post I wrote about 174 and file too short, but what I get is actually 175 (bad crc), in
maria_chk_data_link/pagecache_read(bitmap page 0) (ma_check.c:1760)
Question is: should we ask for a retry in this case, and also if 175 (file too short)?
[13 May 2008 8:23] Guilhem Bichot
Do we need a
param->testflag|=T_RETRY_WITHOUT_QUICK;
if _ma_scan_block_record() fails because "file too short" or "bad page crc" during repair?
[13 May 2008 20:09] Guilhem Bichot
This if() in ha_myisam.cc:
  while ((error=repair(thd,param,0)) && param.retry_repair)
  {
    param.retry_repair=0;
    if (test_all_bits(param.testflag,
		      (uint) (T_RETRY_WITHOUT_QUICK | T_QUICK)))
    {
      param.testflag&= ~T_RETRY_WITHOUT_QUICK;
      sql_print_information("Retrying repair of: '%s' without quick",
                            table->s->path.str);
      continue;
    }
is never true, because T_RETRY_WITHOUT_QUICK is set by some functions called by mi_repair*() (like sort_get_next_record()), but that is cancelled by this save-restore-testflag logic in 
ha_myisam::repair(THD *thd, HA_CHECK &param, bool do_optimize) :

    ulonglong testflag= param.testflag;
    if (mi_test_if_sort_rep(file,file->state->records,key_map,0) &&
	(local_testflag & T_REP_BY_SORT))
    {
      local_testflag|= T_STATISTICS;
      param.testflag|= T_STATISTICS;		// We get this for free
      statistics_done=1;
      if (thd->variables.myisam_repair_threads>1)
      {
        char buf[40];
        /* TODO: respect myisam_repair_threads variable */
        my_snprintf(buf, 40, "Repair with %d threads", my_count_bits(key_map));
        thd_proc_info(thd, buf);
        error = mi_repair_parallel(&param, file, fixed_name,
                                   test(param.testflag & T_QUICK));
        thd_proc_info(thd, "Repair done"); // to reset proc_info, as
                                      // it was pointing to local buffer
      }
      else
      {
        thd_proc_info(thd, "Repair by sorting");
        error = mi_repair_by_sort(&param, file, fixed_name,
                                  test(param.testflag & T_QUICK));
      }
    }
    else
    {
      thd_proc_info(thd, "Repair with keycache");
      param.testflag &= ~T_REP_BY_SORT;
      error=  mi_repair(&param, file, fixed_name,
			test(param.testflag & T_QUICK));
    }
    param.testflag=testflag;
see: param.testflag loses T_RETRY_WITHOUT_QUICK due to the line above.
This is maybe why the fact that:
    if (test_all_bits(param.testflag,
		      (uint) (T_RETRY_WITHOUT_QUICK | T_QUICK)))
    {
      param.testflag&= ~T_RETRY_WITHOUT_QUICK;
      sql_print_information("Retrying repair of: '%s' without quick",
                            table->s->path.str);
      continue;
    }
does not turn down T_QUICK, is not a bug in real life (the real-life bug is rather this T_RETRY_WITHOUT_QUICK which is never on).

My impression from testing MyISAM and Maria is that in MyISAM, a broken MYD will be fixed by REPAIR TABLE QUICK (due to an auto retry), but not in Maria (I guess param.retry_repair should be turned on, as well as T_RETRY_WITHOUT_QUICK).
[10 Jun 2008 11:21] Guilhem Bichot
To observe the failing auto-repair, just run
storage/maria/ma_test_force_start.pl -d
Unlike storage/maria/ma_test_force_start.pl -i, it will fail to auto-repair the table.
[12 Jun 2008 14:40] Guilhem Bichot
Assigning to Monty because I don't understand all of "case BLOCK_RECORD" in ma_check.c:sort_get_next_record().
[29 Jun 2008 12:08] Michael Widenius
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html
[30 Jun 2008 18:01] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:02] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:03] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:05] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:07] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:09] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:11] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:12] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:14] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:15] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:16] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:18] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:18] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:29] Bugs System
Pushed into 5.1.25-maria-alpha
[30 Jun 2008 18:37] Bugs System
Pushed into 5.1.25-maria-alpha