Bug #17788 LCP should start on out of Redo, ndb_restore should retry more
Submitted: 28 Feb 2006 15:57 Modified: 11 Sep 2009 7:33
Reporter: Johan Andersson Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:mysql-4.1 OS:Any (*)
Assigned to: CPU Architecture:Any
Tags: 4.1->
Triage: Triaged: D3 (Medium)

[28 Feb 2006 15:57] Johan Andersson
Description:
ndb_restore is not so dynamic and can easily cause redo log associated errors.
ndb_restore should take into account the tuple size, so that it can adapt the parallelism so it does not try to push too much data, so that neither Redo buffers or Send buffers are exploded. 

Also, when restoring, it would be nice if ndb_restore does not terminate with "aborted".

How to repeat:
N/A

Suggested fix:
Adaptiveness
[22 Jan 2007 12:14] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/18534

ChangeSet@1.2539, 2007-01-22 20:11:07+08:00, gni@dev3-221.dev.cn.tlan +7 -0
  BUG#17788 ndb_restore has more 'adaptive' functions. When the 410 temperary error occurs,
  It will send LCP immediately start signal.
[24 Jan 2007 10:25] Guangbao Ni
how to reproduce it:
1. create a table and insert records into it. 
   You must ensure that the size of the table is greater than Redo log size.(For example , if you use the default value about TimeBetweenLocalCheckpoints and NoOfFragmentLogFiles, the table size is greater than 800M). 
  You can set a large value for TimeBetweenLocalCheckpoints  and a small value for NoOfFragmentLogFiles, and then you can use a small size table.
2. start backup in ndb_mgm
3. restart ndb cluster with --initial option
4. ndb_restore it with the backup data
  During the process of ndb_restore, you will get the error message
[29 Mar 2007 7:44] Stewart Smith
What's the status with this bug?

Last conversation I can find is at the end of January (and I think we had some IRC discussion too). Basically saying that we should be able to trigger the start LCP from kernel on error instead... with me not liking the use of the dump interface here.

current status?
[24 Apr 2007 11:20] Guangbao Ni
Hi Jonas,
   Stewart suggests that i should define a new signal to start LCP, and if i use NDB API, i should add new interface in the Ndb class, or to trigger the
start LCP from kernel on error instead. whichever method i adopted, the new signals must be defined.
  what do you think? please give your suggestion.

  thanks!

/Guangbao Ni
[27 Jun 2007 4:00] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/29661

ChangeSet@1.2473, 2007-06-27 09:36:59+08:00, gni@dev3-221.dev.cn.tlan +9 -0
  BUG#17788 ndb_restore is too static in its behavior.
[6 Jul 2007 4:21] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/30420

ChangeSet@1.2473, 2007-07-06 11:49:16+08:00, gni@dev3-221.dev.cn.tlan +9 -0
  BUG#17788 ndb_restore is too static in its behavior.
[19 Jul 2007 5:14] Stewart Smith
Looks okay to me.

Since Jonas is on vacation, Pekka - can you have a quick look too?

I think we should only apply this to 5.1 though.

I would still like a test case.
[27 Jul 2007 5:09] Stewart Smith
please also check what ndb_restore does in the temporary error situation... does it retry for long enough? or does it give up at some point? if it gives up.... this is a problem with large LCP
[27 Jul 2007 10:06] Guangbao Ni
Hi Stewart,
  Before fixed, it will abort after 10 retries for the same transaction.
the patch is to solve the problem, make it be self-recoverable from the temperaary error.
[6 Aug 2007 2:02] Stewart Smith
I think we should continue to retry (not limit it to 10). Naturally displaying some kind of warning though.
[14 Aug 2007 3:37] Stewart Smith
Setting back to In Progress as still something to be done.
[15 Aug 2007 8:26] Guangbao Ni
Hi Stewart,
   if a test case wants to insert a error to ndbd kernel, it will use the NdbTamper()  (NDBAPI) and NDB_TAMPER signal?
   the test case should be put in ndb/test/ndbapi directory?
[17 Aug 2007 1:44] Stewart Smith
there's an mgmapi function to do it:

  /**
   * Provoke an error.
   *
   * @param handle the NDB management handle.
   * @param nodeId the node id.
   * @param errrorCode the errorCode.
   * @param reply the reply message.
   * @return 0 if successful or an error code.
   */
  int ndb_mgm_insert_error(NdbMgmHandle handle,
                           int nodeId, 
                           int errorCode,
                           struct ndb_mgm_reply* reply);
[6 Sep 2007 1:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/33774

ChangeSet@1.2473, 2007-09-06 09:36:12+08:00, gni@dev3-221.dev.cn.tlan +11 -0
  BUG#17788 LCP should start on out of Redo, ndb_restore should retry more.
[12 Nov 2007 2:55] Stewart Smith
I think the test program is missing from the patch.

also, this makes retries==100, not "infinite" in restore.
[14 Nov 2007 2:38] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/37720

ChangeSet@1.2473, 2007-11-14 10:24:52+08:00, gni@dev3-221.dev.cn.tlan +12 -0
  BUG#17788 LCP should start on out of Redo, ndb_restore should retry more.