Bug #16875 Using stale MySQLD FRM files can cause restored cluster to fail
Submitted: 28 Jan 2006 22:12 Modified: 22 May 2006 11:28
Reporter: Jonathan Miller
Status: Closed
Category:Server: Cluster Severity:S2 (Serious)
Version:4.1 -> OS:
Assigned to: Tomas Ulin Target Version:

[28 Jan 2006 22:12] Jonathan Miller
Description:
I restored a 6 Million row database using DD. I have 7 MySQLD processes that are attached
to the cluster. Through MySQLD I can see the tables and get counts off the tables, but as
soon as I start the test against the database the cluster goes down.

006-01-28 21:46:42 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed.
Initiated by signal 0. Caused by error 2341: 'Internal program error (failed
ndbrequire)(Internal error, programming error or missing error message, please report a
bug). Temporary error, restart node'.
2006-01-28 21:46:43 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-01-28 21:46:43 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed.
Initiated by signal 0. Caused by error 2341: 'Internal program error (failed
ndbrequire)(Internal error, programming error or missing error message, please report a
bug). Temporary error, restart node'.

Time: Saturday 28 January 2006 - 21:46:41
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or
missing error message, please report a bug)
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 1273) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 28473
Trace: /space/run/ndb_3_trace.log.2

Time: Saturday 28 January 2006 - 21:46:42
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or
missing error message, please report a bug)
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 1273) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 30106
Trace: /space/run/ndb_2_trace.log.2
Version: Version 5.1.6 (alpha)
***EOM***

How to repeat:
restore saved file and try to run a test to the cluster
[28 Jan 2006 22:42] Jonas Oreland
I could not find any tracefiles...

BTW: Can you start using the ndb_error_reporter tool that Stewart wrote?
[29 Jan 2006 0:14] Jonathan Miller
I restored the database again, and then went to each MySQLD and wipped the file system
clean and recreated the TPCB for each of the 8 Processes. The test then started and the
cluster has stayed up.

Before some mysqld I could use w/o issues, but other as soon as the test started the
cluster would come down.
[29 Jan 2006 0:33] Jonathan Miller
Everyonce in a while I would get;

ERROR 1412 (HY000): Table definition has changed, please retry transaction
[29 Jan 2006 2:29] Jonathan Miller
I just restored the DD Cluster database and total recreated all the TPCB database files
for each MySQLD process. Test started w/o issue.
[30 Jan 2006 13:29] Jonathan Miller
What do you need feed back on?
[30 Jan 2006 13:33] Jonathan Miller
Sorry, did not see the question.

I think the way to produce this is to have several MySQLD instances, use them for a
while, restore a/the database and atemp to do a transaction such as an insert. you will
get a temp error and cluster it gone.

If you removed all the file for the mysqld and recreate them before attaching to the
cluster with the restored database, then attach and create the new database, all if fine.
[31 Jan 2006 9:35] Jonas Oreland
Jeb,
when you say "restored", did you do a initial start before restoring?
[31 Jan 2006 12:57] Jonathan Miller
Tomas,

I will be moving to the 64 bit tests today, and will see if I can get it down to a set of
steps on my side.

Jonas,

Actaully I would do a rm -rf ndb_#_fs before attempting the restore. This ensured that
the ndb fs and the disk data and undo files were are removed before the restore, as
--initial does not remove disk data files.

Thanks
JBM
[31 Jan 2006 13:07] Jonas Oreland
Ok, then this a "know bug" also present in 4.1,5.0
The problem is that the mysqld keeps a copy of a table object (tableid, tableversion)
And after initial start/restore then this table might not be the same one.
So mysqld sends data with tableid/tableversion that ndb dont know is incorrect which
  yields inpredicatble results.

The solution is to close all ndb objects/ndb handler on cluster failure
  And let mysqld retry instead.

Tomas suggested that we fix this in 5.1 but dont do it in 4.1,5.0.

The problem can only occur with initial start/restore + keeping mysqld's alive
[31 Jan 2006 13:14] Jonathan Miller
I am okay with not fixing in 4.1, but not totally sure why we would want to leave out 5.0.
But I am glad that you know what it casuing the issues.
JBM
[2 Feb 2006 4:46] Stewart Smith
If we solved this by introducing a cluster unique id and sending it around when nodes join
we could then solve the potential yucky situation of where a (arguably dumb) administrator
starts swapping nodes between two different clusters.

Even if the option is to barf saying "trying to join a different cluster, aborting
connect" it would be better than now :)
[15 May 2006 14:32] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6383
[16 May 2006 8:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6430
[16 May 2006 19:22] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6473
[17 May 2006 6:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6489
[22 May 2006 8:43] Tomas Ulin
pushed do 4.1.20, 5.0.22, 5.1.11
[22 May 2006 11:28] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 4.0.20/5.0.22/5.1.11 changelogs.

Documented DD limitation in 5.1 Manual Cluster Chapter DD section.