Bug #10987 After NDBD failure, restart NDBD dies stating Unable to find restorable replica
Submitted: 31 May 2005 15:11 Modified: 10 Feb 2006 8:13
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:4.1, 5.0 OS:Linux (Linux)
Assigned to: Jonas Oreland CPU Architecture:Any

[31 May 2005 15:11] Jonathan Miller
Description:
After the node failure, I shutdown the cluster and brought it backup trying the get the failed node back. It died stating that "Unable to find restorable replica".

Email trail:
-----------------------------------------------------------------------------
From: Tomas
That is a bug that occurs sometimes.  If you can find a reproducable testcase for this that would be fantastic!

Can you describe _exactly_ the steps you took to get to that situation?

Every detail you can provide is relevant...  shutting down nodes, the timing for that, in what order, what was running against the cluster... 
in what order did you bring the nodes up etc....

If you have the ndb filesystem still and logs... please save them...

T
-----------------------------------------------------------
From: Jonathan
1) I killed -9 the process I thought would be node2. And it was.
2) logged into the ndb_mgm and did a show. Then a all status.
3) logged into master1 and did a use GOTOSLAVE;
4) SHOW TABLES;
5) select * from t1;
6) Logged out of master1;
7) Logged onto master2;
8) repeated steps 3, 4, and 5.
9) Went to slave and repeated steps 3, 4, and 5 for the slave.
10) Tried to kill a node on the slave. Killed one to many as it crashed. My fault not Jonas's.
11) Did a shutdown on the NDB_MGM for master cluster.
12) Shutdown Master1 and Master2 Mysqld
13) restarted ndb_mgmd
14) restart all four data nodes w/o --initial Boom!!! Down it went.

Is there a defect # for this?

JBM
-----------------------------------------------------------------------
From: Thomas

Missing some details...

No transactions running?

11) you did a "regular shutdown" of the cluster?  and node2 was down already in step 1)?  How long time between 1) and 11) you reckon?

--------------------------------------------------------------------------------
From: Jonathan

No transaction running. Was going to try that to night after I got through testing Mats fix. 11) Yes and Yes regular shutdown and node2 was already dead. Time between would be about 15 - 25 minutes. Doing some poking around through error logs and such.
JBM
---------------------------------------------------------------------------------------
From Tomas:

were there _any_ transaction between 1) and 11)?
how long before 1) were there transactions running?

----------------------------------------------------------------------------------------
From Jonathan

No there was not any transaction running other then the ones in the list below. And I had been doing everything by had before. I had just open the defect about the drop and create on A1 not getting to B.

How to repeat:
I will be trying to get a reproducible test case for this issue this week.
[31 May 2005 15:25] Pekka Nousiainen
Could be dictionary corruption.
Compile printSchemaFile.cpp under .../dbdict and try:
printSchemaFile  ndb_*_fs/D1/DBDICT/P0.SchemaLog
[17 Jun 2005 7:03] Tomas Ulin
This is a known issue that this can happen, it is present already in 4.1.

Assigning to Martin for decision what to do about it
[17 Jun 2005 7:37] Martin Skold
Did you try printing the schema file as Pekka suggested?
[17 Jun 2005 12:34] Jonathan Miller
I had meant to close this one and one other that I can not get to reproduce. If I get into this situation again, I will do as Pekka has suggested.
[31 Aug 2005 13:12] Jonas Oreland
GCI 345 Completed
LCP 17  Started
GCI 346 Started
GCI 346 Completed
LCP 17  Completed
LCP 18  Started
GCI 347 Started
LCP 18  Completed
~~~ CRASH ~~~~~~
Unable to restore GCI 346
[2 Sep 2005 9:48] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/29232
[2 Sep 2005 10:03] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/internals/29235
[5 Sep 2005 4:48] Jonas Oreland
Pushed into 4.1.15 and 5.0.13
[8 Sep 2005 8:11] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 4.1.15 & 5.0.13 changelogs.
[25 Jan 2006 10:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/1604
[29 Jan 2006 14:55] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/1809
[29 Jan 2006 22:15] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/1815
[30 Jan 2006 7:25] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/1842
[30 Jan 2006 20:08] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/1894
[31 Jan 2006 10:44] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/1928
[2 Feb 2006 6:23] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/2045
[10 Feb 2006 8:39] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/2420
[16 Feb 2006 12:40] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/2725