Bug #16875 Using stale MySQLD FRM files can cause restored cluster to fail
Submitted: 28 Jan 2006 21:12 Modified: 22 May 2006 9:28
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:4.1 -> OS:
Assigned to: Tomas Ulin

[28 Jan 2006 21:12] Jonathan Miller
Description:
I restored a 6 Million row database using DD. I have 7 MySQLD processes that are attached to the cluster. Through MySQLD I can see the tables and get counts off the tables, but as soon as I start the test against the database the cluster goes down.

006-01-28 21:46:42 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2006-01-28 21:46:43 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-01-28 21:46:43 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Time: Saturday 28 January 2006 - 21:46:41
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 1273) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 28473
Trace: /space/run/ndb_3_trace.log.2

Time: Saturday 28 January 2006 - 21:46:42
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 1273) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 30106
Trace: /space/run/ndb_2_trace.log.2
Version: Version 5.1.6 (alpha)
***EOM***

How to repeat:
restore saved file and try to run a test to the cluster
[28 Jan 2006 21:42] Jonas Oreland
I could not find any tracefiles...

BTW: Can you start using the ndb_error_reporter tool that Stewart wrote?
[28 Jan 2006 23:14] Jonathan Miller
I restored the database again, and then went to each MySQLD and wipped the file system clean and recreated the TPCB for each of the 8 Processes. The test then started and the cluster has stayed up.

Before some mysqld I could use w/o issues, but other as soon as the test started the cluster would come down.
[28 Jan 2006 23:33] Jonathan Miller
Everyonce in a while I would get;

ERROR 1412 (HY000): Table definition has changed, please retry transaction
[29 Jan 2006 1:29] Jonathan Miller
I just restored the DD Cluster database and total recreated all the TPCB database files for each MySQLD process. Test started w/o issue.
[30 Jan 2006 12:29] Jonathan Miller
What do you need feed back on?
[30 Jan 2006 12:33] Jonathan Miller
Sorry, did not see the question.

I think the way to produce this is to have several MySQLD instances, use them for a while, restore a/the database and atemp to do a transaction such as an insert. you will get a temp error and cluster it gone.

If you removed all the file for the mysqld and recreate them before attaching to the cluster with the restored database, then attach and create the new database, all if fine.
[31 Jan 2006 8:35] Jonas Oreland
Jeb,
when you say "restored", did you do a initial start before restoring?
[31 Jan 2006 11:57] Jonathan Miller
Tomas,

I will be moving to the 64 bit tests today, and will see if I can get it down to a set of steps on my side.

Jonas,

Actaully I would do a rm -rf ndb_#_fs before attempting the restore. This ensured that the ndb fs and the disk data and undo files were are removed before the restore, as --initial does not remove disk data files.

Thanks
JBM
[31 Jan 2006 12:07] Jonas Oreland
Ok, then this a "know bug" also present in 4.1,5.0
The problem is that the mysqld keeps a copy of a table object (tableid, tableversion)
And after initial start/restore then this table might not be the same one.
So mysqld sends data with tableid/tableversion that ndb dont know is incorrect which
  yields inpredicatble results.

The solution is to close all ndb objects/ndb handler on cluster failure
  And let mysqld retry instead.

Tomas suggested that we fix this in 5.1 but dont do it in 4.1,5.0.

The problem can only occur with initial start/restore + keeping mysqld's alive
[31 Jan 2006 12:14] Jonathan Miller
I am okay with not fixing in 4.1, but not totally sure why we would want to leave out 5.0. But I am glad that you know what it casuing the issues.
JBM
[2 Feb 2006 3:46] Stewart Smith
If we solved this by introducing a cluster unique id and sending it around when nodes join we could then solve the potential yucky situation of where a (arguably dumb) administrator starts swapping nodes between two different clusters.

Even if the option is to barf saying "trying to join a different cluster, aborting connect" it would be better than now :)
[15 May 2006 12:32] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6383
[16 May 2006 6:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6430
[16 May 2006 17:22] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6473
[17 May 2006 4:42] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6489
[22 May 2006 6:43] Tomas Ulin
pushed do 4.1.20, 5.0.22, 5.1.11
[22 May 2006 9:28] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 4.0.20/5.0.22/5.1.11 changelogs.

Documented DD limitation in 5.1 Manual Cluster Chapter DD section.