Bug #26498 backup id not incremented when starting backup during restore
Submitted: 20 Feb 2007 15:39 Modified: 30 Mar 2008 0:31
Reporter: Bernd Ocklin Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.32 OS:Linux (Debian 4.0)
Assigned to: li zhou CPU Architecture:Any

[20 Feb 2007 15:39] Bernd Ocklin
Description:
If "start backup" is issued while a restore process runs it may overwrite the backup currently restored.

How to repeat:
ndb_mgm -e "start backup"
--> BACKUP-1 created

ndb_restore BACKUP-1

while this is ongoing do 

ndb_mgm -e "start backup" 

again. It will overwrite BACKUP-1.

Suggested fix:
Block backup when restore is ongoing or increment backup counter when performing backup during restore.
[20 Feb 2007 16:03] Bernd Ocklin
Ok, the main problem is probably that old backups in the backup dir are not recognized if a cluster is restarted. The backup counter will thus be reset and backups will be overwritten.
[21 Feb 2007 12:30] Hartmut Holzgraefe
verified, not sure how to handle this though, an initial restart will also reset the backup generation number to 1, and as ndb_restore is just another ndbapi client so the ndbds can't really know that a restore is just now happening

maybe ndb_restore and backup could both try to get exclusive locks on the files though?
[21 Feb 2007 14:01] Bernd Ocklin
The ideal implementation would just read the BACKUP directories on the nodes at node startup and report the highest number to the ndb_mgmd to adjust the backup counter.
 
This would just always avoid overwriting of backups in any situation.
[22 Mar 2007 17:59] Oli Sennhauser
1. Keep a counter in a file which is not deleted after initial
2. Find the max. number of BACKUP-nnn and use the next one.
3. Refuse to "--inital" if backup is already there or add flag --purge_backup with --inital (in combination with 2.)
[3 Apr 2007 8:01] Stewart Smith
I think I prefer option 2.
[3 Apr 2007 8:10] Stewart Smith
The backup ID is a counter in the system table in NDB (SYSTAB_0). I forget the magic number... it's a define somewhere in Backup.hpp (or at least it's referenced in Backup.cpp). You can even query it using ndb_select_all. IIRC these are maintained using routines in DBUTIL.

The process should be, when we get the value of the counter, check if the directory exists (using call to NDBFS). If it does, increment counter, check again. if not, update counter, proceed with starting backup.

I think we should just plan this for 5.1
[5 Apr 2007 7:18] Stewart Smith
I think we should only change this in 5.1.

The BACKUP protocol is a MASTER/SLAVE protocol. A master sets up the backup and tells the slaves what to do. On any node failure, backup is aborted.

The storage/ndb/src/kernel/blocks/backup/Backup.txt partially documents the backup protocol and communication between user, master and slaves.

The UTIL_SEQUENCE signal is used to get a sequence number (a row in SYSTAB_0... viewable with ndb_select_all if you're so inclined). We then send this number to the slaves (who open files to write to).

We should add to the slave code when it's going to open files (around the FSOPEN ignals) to abort the backup (by replying with DEFINE_BACKUP_REF) if any of the files already exist.

This will mean the end user gets an error "could not start backup: files already existed" (or similar).

The testBackup program that's in ndb/test tests backup in a variety of failure modes. This *must* continue to pass with this patch.

Thought should also be given to adding some error inserts for this new code... especially for situations where files only exist in one of the slaves, half slaves.

We should also add an extra option for 'start backup' so the user can 'start backup <id>' to try starting a backup with that id. This ID should be subject to the same checks - the master should just set the sequence number to this ID before continuing.

This will involve:
 - adding (optional) parameter to ndb_mgm
 - adding (optional) int parameter to mgm protocol
 - adding another mgm api start_backup call with this parameter
 - possibly changing the BACKUP_REQ signal
 - changing code in the BACKUP block to (if the backup id field is present in the signal) set the sequence number. An old format signal shouldn't be a problem for the data node (it should assume normal behaviour).
 - adding tests for the above

happy to field extra questions.... hope this helps
[10 Apr 2007 14:26] Stewart Smith
Problem is if we just abort if backup with that id exists is if there is 100 or so backups... 100 backups will have to be run before one is successfully created.

I think it's quite likely that people will just have ndb_mgm -e 'start backup' in cron... so this is potentially a lot of time without a backup.
[7 Jun 2007 10:02] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28278

ChangeSet@1.2491, 2007-06-07 17:51:55+00:00, lzhou@dev3-63.(none) +32 -0
  BUG#26498 implement "start backup n" and abort backup when backup files exist
[14 Aug 2007 3:41] Stewart Smith
(comments from Jun 18 mail):

Few comments/thoughts

- would be good to have some cleanup before doing full review, some
parts are a bit hard to read.
- would also be good to be against latest 5.1 tree... it doesn't cleanly
apply to mine :(
- error to mgmclient should be temporary error, not permanent
- does testBackup still pass okay?
- you should add some error inserts for testBackup testing NF
- Jonas (i think) recently added an option to FSOPENREQ to fail if file
exists... possibly good to use it instead of the STATREQ (this would
make the number of errors to inject for testbackup fewer.. as the NF
code around there should be fairly solid)
- ndb_mgm_start_backup should be a wrapper around
ndb_mgm_start_num_backup
- semantics of backupId for NO_WAIT for ndb_mgm_start_num_backup should
be documented
- FSSTATREQ seems a lot more complicated than needed.. minimise it to
exactly what is used (unless you use Jonas' parameter to FSOPENREQ in
which case STAT isn't needed).
- I'd make the string "Backup failed for the reason that backup files
aleady existed" to "Backup failed: file already exists (use 'START
BACKUP N')"

>      Some code maybe need to rewrite because i modify so much, and may 
> miss something.
>      Two things need to discuss about this patch are:
>               1) Max backup number.

- probably should be 2^32... but that's a different TCKEYREQ signal...
and I think Jonas will have to explain the magic there... let's get
everything else right first though.

>               2) Output to mgmclient.

mostly looks okay.
[4 Sep 2007 6:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/33607

ChangeSet@1.2548, 2007-09-04 14:22:46+00:00, lzhou@dev3-63.(none) +33 -0
  BUG#26498 Implement two function:
     1: Can start a backup using "Backup N".
     2: Backup process will stop when backup files exist
[4 Sep 2007 6:41] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/33608

ChangeSet@1.2548, 2007-09-04 14:27:42+00:00, lzhou@dev3-63.(none) +35 -0
  BUG#26498 Implement two function:
     1: Can start a backup using "Backup N".
     2: Backup process will stop when backup files exist
[13 Sep 2007 13:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34175

ChangeSet@1.2548, 2007-09-13 21:17:19+08:00, lzhou@zhl.ndb.mysql.com +17 -0
  BUG#26498 Implement two function:
     1: Can start a backup using "Backup N".
     2: Backup process will stop when backup files exist
[13 Sep 2007 13:40] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34179

ChangeSet@1.2548, 2007-09-13 21:39:49+08:00, lzhou@zhl.ndb.mysql.com +17 -0
  BUG#26498 Implement two function:
     1: Can start a backup using "Backup N".
     2: Backup process will stop when backup files exist
[14 Sep 2007 14:13] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/34277

ChangeSet@1.2548, 2007-09-14 22:12:18+08:00, lzhou@zhl.ndb.mysql.com +17 -0
  BUG#26498 Implement two function:
     1: Can start a backup using "Backup N".
     2: Backup process will stop when backup files exist
[12 Nov 2007 3:09] Stewart Smith
comments sent to list.
[13 Nov 2007 6:04] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/37631

ChangeSet@1.2548, 2007-11-13 13:45:42+00:00, lzhou@dev3-63.(none) +17 -0
  BUG#26498 Implement two function:
     1: Can start a backup using "Backup N".
     2: Backup process will stop when backup files exist
[20 Feb 2008 11:12] Stewart Smith
I think we should only push to telco though (6.3 or 6.4)
[3 Mar 2008 3:36] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/43292

ChangeSet@1.2560, 2008-03-03 11:12:37+00:00, lzhou@dev3-63.(none) +17 -0
  BUG#26498 Implement two function:
      1: Can start a backup using "Backup N".
      2: Backup process will stop when backup files exist.
[3 Mar 2008 6:52] li zhou
pushed into mysql-5.1-telco-6.3(5.1.23-ndb-6.3.11)
[30 Mar 2008 0:31] Jon Stephens
Documented in the 5.1.23-ndb-6.3.11 changelog as follows:

        If a START BACKUP command was issued while ndb_restore was running, the 
        backup being restored could be overwritten.