MySQL Bugs: #16875: Using stale MySQLD FRM files can cause restored cluster to fail

Bug #16875	Using stale MySQLD FRM files can cause restored cluster to fail
Submitted:	28 Jan 2006 21:12	Modified:	22 May 2006 9:28
Reporter:	Jonathan Miller	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	4.1 ->	OS:
Assigned to:	Tomas Ulin	CPU Architecture:	Any

Description:
I restored a 6 Million row database using DD. I have 7 MySQLD processes that are attached to the cluster. Through MySQLD I can see the tables and get counts off the tables, but as soon as I start the test against the database the cluster goes down.

006-01-28 21:46:42 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2006-01-28 21:46:43 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2006-01-28 21:46:43 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Time: Saturday 28 January 2006 - 21:46:41
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 1273) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 28473
Trace: /space/run/ndb_3_trace.log.2

Time: Saturday 28 January 2006 - 21:46:42
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: DbaccMain.cpp
Error object: DBACC (Line: 1273) 0x0000000a
Program: /home/ndbdev/jmiller/builds/libexec/ndbd
Pid: 30106
Trace: /space/run/ndb_2_trace.log.2
Version: Version 5.1.6 (alpha)
***EOM***

How to repeat:
restore saved file and try to run a test to the cluster

I could not find any tracefiles...

BTW: Can you start using the ndb_error_reporter tool that Stewart wrote?

I restored the database again, and then went to each MySQLD and wipped the file system clean and recreated the TPCB for each of the 8 Processes. The test then started and the cluster has stayed up.

Before some mysqld I could use w/o issues, but other as soon as the test started the cluster would come down.

Everyonce in a while I would get;

ERROR 1412 (HY000): Table definition has changed, please retry transaction

I just restored the DD Cluster database and total recreated all the TPCB database files for each MySQLD process. Test started w/o issue.

What do you need feed back on?

Sorry, did not see the question.

I think the way to produce this is to have several MySQLD instances, use them for a while, restore a/the database and atemp to do a transaction such as an insert. you will get a temp error and cluster it gone.

If you removed all the file for the mysqld and recreate them before attaching to the cluster with the restored database, then attach and create the new database, all if fine.

Jeb,
when you say "restored", did you do a initial start before restoring?

Tomas,

I will be moving to the 64 bit tests today, and will see if I can get it down to a set of steps on my side.

Jonas,

Actaully I would do a rm -rf ndb_#_fs before attempting the restore. This ensured that the ndb fs and the disk data and undo files were are removed before the restore, as --initial does not remove disk data files.

Thanks
JBM

Ok, then this a "know bug" also present in 4.1,5.0
The problem is that the mysqld keeps a copy of a table object (tableid, tableversion)
And after initial start/restore then this table might not be the same one.
So mysqld sends data with tableid/tableversion that ndb dont know is incorrect which
  yields inpredicatble results.

The solution is to close all ndb objects/ndb handler on cluster failure
  And let mysqld retry instead.

Tomas suggested that we fix this in 5.1 but dont do it in 4.1,5.0.

The problem can only occur with initial start/restore + keeping mysqld's alive

I am okay with not fixing in 4.1, but not totally sure why we would want to leave out 5.0. But I am glad that you know what it casuing the issues.
JBM

If we solved this by introducing a cluster unique id and sending it around when nodes join we could then solve the potential yucky situation of where a (arguably dumb) administrator starts swapping nodes between two different clusters.

Even if the option is to barf saying "trying to join a different cluster, aborting connect" it would be better than now :)

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6383

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6430

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6473

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6489

pushed do 4.1.20, 5.0.22, 5.1.11

Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix in 4.0.20/5.0.22/5.1.11 changelogs.

Documented DD limitation in 5.1 Manual Cluster Chapter DD section.