Bug #36993 Falcon reports Index SCHEDULE..PRIMARY_KEY in SYSTEM.SCHEDULE damaged
Submitted: 26 May 2008 23:01 Modified: 15 May 16:12
Reporter: Philip Stoev
Status: Closed
Category:Server: Falcon Severity:S2 (Serious)
Version:6.0-falcon-team OS:Any
Assigned to: Vladislav Vaintroub Target Version:6.0-beta
Tags: F_STARTUP
Triage: Triaged: D2 (Serious)

[26 May 2008 23:01] Philip Stoev
Description:
1) I started sysbench against falcon with --falcon_gopher_threads=0
--falcon_scavenge_schedule='1 1 1 1 1' --falcon_checkpoint_schedule='1 1 1 1 1'.
2) When the serial log files reached about 10Mb size, I killed the server 
3) Recovery was successful, however the following message was written to the log

Exception: Can't find field 0 of index SCHEDULE..PRIMARY_KEY of table SYSTEM.SCHEDULE

Index SCHEDULE..PRIMARY_KEY in SYSTEM.SCHEDULE damaged: Can't find field 0 of index
SCHEDULE..PRIMARY_KEY of table SYSTEM.SCHEDULE

How to repeat:
I will attach the tablespace prior to recovery.
[4 Jun 2008 18:44] Philip Stoev
Philip needs to reproduce this with falcon_gopher_threads > 0
[26 Jan 11:28] Philip Stoev
This continues to happen when Falcon is killed and restarted before it has been used
much.

I am increasing the triage values of this bug because the SCHEDULE table is nothing
special, same corruption may happen on other Falcon system or user tables.
[20 Feb 12:30] Lars-Erik Bjørk
Related to recovery, Vlad's specialty
[11 Apr 0:13] Christopher Powers
Vlad Vaintroub:

Suspected cause: Kill -9 before system tables were completely created.
Suggested fix: Won't fix (good workaround)
Workaround: Delete all falcon spaces and serial logs.
[11 Apr 0:13] Christopher Powers
Philip Stoev:
Note that this is just an error printed in the log, the database continues to run.
Therefore "delete all falcon tablespaces" is not a good workaround because a person may
not even notice the problem, since it does not reveal itself in a crash. God knows what
else is also damaged.

Also, the kill -9 did not happen while the server was starting up. The server had already
started and databases and tables were created by the time the kill -9 arrived. Therefore,
it is not about "killing before system tables were completely created", it may be about
"killing before gophers applied all serial log events related to system tables".

So, this remains a valid bug for me. I do intend to test recovery systematically with
kill -9 immediately after server startup, so a decision and a solution must be
implemented for that one. Maybe the solution is to do extra checkpoints after creating
the system tables and waiting for the gophers to write everything to disk.
[11 Apr 0:19] Christopher Powers
Vlad: And what you do if you kill before checkpoint has run?

Philip: It appears to me that the current behavior is as follows:
  1. Falcon starts up, system tables are created in memory
  2. Server becomes available for connections
  3. Queries start arriving
  4. A scheduled checkpoint arrives, the gophers write the system tables
     to disk, etc.

If there is a crash in Step #3, you can not use a workaround "delete tablespaces and
start from scratch", because you would loose the transactions that were issued by the
users. So, instead, maybe this will work:
  1. Falcon starts up
  2. System tables are created and flushed to disk, force two checkpoints,
     waits for gophers to complete, whatever is needed
  3. Server becomes available for connections
  4. Queries start arriving

This way, for crashes in Step #2, the workaround can be "delete tablespaces and start
from scratch". Crashes in Step #4 should recover properly without waivers.

Vlad: If step3 took < 30 seconds, I'd think "delete tablespaces and start from scratch"
is still a reasonable workaround. We are not talking about lost
terabytes of user data, do we?

Philip: I do not think a 30-second data loss is very acceptable :-)

If two consequtive forced checkpoints or some other (simple) trick will reduce the
window, then let's go for it. Note that by default mysqld is being automatically
restarted at every crash by the safe_mysqld script.

This means that a customer could easily rack up repeated restarts and recoveries without
even noticing.
[19 Apr 22:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72480

3129 Vladislav Vaintroub	2009-04-19
      Bug #36993 Falcon reports Index SCHEDULE..PRIMARY_KEY in SYSTEM.SCHEDULE damaged
      
      The problem here is that mysqld was killed before database was completely created
(i.e
      before all data dictionary was completely written to the disk). Falcon cannot
handle such 
      sutuations gracefully yet and recovery after such point is not guaranteed to
succeed.
      
      The patch improves the sutation a little bit, disabling  user queiries until
database is fully
      created and written to the disk. 
      
      Also, this patch introduces a clean Falcon shutdown : waiting for background theads
to
      complete  their work , followed by flushing the page cache. This will  eliminate
the need
      for recovery after  a clean shutdown.
[19 Apr 22:18] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72481

3129 Vladislav Vaintroub	2009-04-19
      Bug #36993 Falcon reports Index SCHEDULE..PRIMARY_KEY in SYSTEM.SCHEDULE damaged
      
      The problem here is that mysqld was killed before database was completely created 
      (i.e before all data dictionary was completely written to the disk). Falcon cannot
       handle such sutuations gracefully yet, recovery after such point is not guaranteed
      to succeed.
      
      The patch improves the sutation a little bit, disabling  user queiries until
database is
      fully created and written to the disk. 
      
      Also, this patch introduces a clean Falcon shutdown : waiting for background theads
      to complete  their work , followed by flushing the page cache. This will  eliminate
the
      need for recovery after  a clean shutdown.
[19 Apr 22:21] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/72482

3129 Vladislav Vaintroub	2009-04-19
      Bug #36993 Falcon reports Index SCHEDULE..PRIMARY_KEY in SYSTEM.SCHEDULE 
      damaged
      
      The problem here is that mysqld was killed before database was completely created 
      (i.e before all data dictionary was completely written to the disk). Falcon cannot
       handle such sutuations gracefully yet, recovery after such point is not guaranteed
      to succeed.
      
      The patch improves the sutation a little bit, disabling  user queiries until
database is
      fully created and written to the disk. 
      
      Also, this patch introduces a clean Falcon shutdown : waiting for background theads
      to complete  their work , followed by flushing the page cache. This will  eliminate
the
      need for recovery after  a clean shutdown.
[23 Apr 9:22] Bugs System
Pushed into 6.0.11-alpha (revid:alik@sun.com-20090423071213-afmyrzvolemph7mz) (version
source revid:hky@sun.com-20090421195958-j33v1cuo3yer9niu) (merge vers: 6.0.11-alpha)
(pib:6)
[15 May 16:12] MC Brown
An entry has been added to the 6.0.11 changelog: 

Trying to recover Falcon tables after a crash when the corresponding tables and
tablespaces have not been created before the crash could cause a recovery failure.