MySQL Bugs: #45385: Online upgrade 6.3.24 -> 7.0.x fails

Bug #45385	Online upgrade 6.3.24 -> 7.0.x fails
Submitted:	9 Jun 2009 3:05	Modified:	15 Oct 2009 12:24
Reporter:	Phil Bayfield	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.0	OS:	Linux
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	7.0.6, 7.0.7

Description:
I built 7.0.6 from launchpad code rev 2925 tagged as mysql-5.1.34-telco-7.0.6

Works fine in testing but online upgrade fails, node won't start:

Time: Tuesday 9 June 2009 - 02:47:00
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 17190) 0x0000000a
Program: ndbd
Pid: 13600
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.9
Version: mysql-5.1.34 ndb-7.0.6
***EOM***

Time: Tuesday 9 June 2009 - 02:52:36
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 17190) 0x0000000a
Program: ndbd
Pid: 13655
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.10
Version: mysql-5.1.34 ndb-7.0.6
***EOM***

How to repeat:
Online upgrade from 6.3.24

ndb_mgmd logs:

2009-06-09 02:48:08 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 3: Node 4: API mysql-5.1.34 ndb-7.0.6
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Node 3 Connected
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 11
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Node 3: API mysql-5.1.32 ndb-6.3.24
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Start phase 1 completed
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Receive arbitrator node 1 [ticket=0b6800342a074cca]
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Start phase 2 completed (node restart)
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 3: DICT: locked by node 4 for NodeRestart
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Start phase 3 completed (node restart)
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Start phase 4 completed (node restart)
2009-06-09 02:52:02 [MgmSrvr] INFO     -- Node 4: Starting to restore schema
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2009-06-09 02:52:37 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2009-06-09 02:52:37 [MgmSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2009-06-09 02:52:37 [MgmSrvr] INFO     -- Node 3: DICT: remove lock by failed node 4 for NodeRestart
2009-06-09 02:52:37 [MgmSrvr] INFO     -- Node 3: DICT: unlocked by node 4 for NodeRestart
2009-06-09 02:52:38 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected

Hi,

Can you please upload
- "full"/zipped cluster-log,
- schema (e.g from mysqldump or similar)

/Jonas

it's definitely index related, in particular index 31
can you on your 6.3 cluster run
> ndb_show_tables | grep "31"

which will show the name of index 31

/Jonas

This is what it shows for number 31:

31    IndexTrigger         Online   -                             NDB$INDEX_60_CUSTOM
31    OrderedIndex         Online   No      sys          def      PRIMARY

Hi,

I've now tested with your schema file...and it worked fine :-(
(comment: you probably shouldn't dump the "mysql"-database)

For the 783 error (which I also got) I filed http://bugs.mysql.com/45525

---

1) There should also be somewhat interesting information in the out-file of the datanode(s).
2) You can try starting the 7.0.6 binary "--initial" (but make an OS-backup first)

---

Not really sure how to proceed.

I'll make some changes to the live db schema to reflect what I got to work in the sandbox setup. I used the same binary I compiled so maybe it will work (or maybe not!) :)

I've noticed on the Cluster forum I'm also not the only one having this issue as well.

It wasnt liking a couple of tables with many indexes (15?) and on a UNIQUE key, which when changed to a normal index was fine. The database in question was something we built a few years ago and probably isn't the best designed schema. I have no idea why there were so many indexes in all honesty, I think possibly it was an attempt to speed up problems with the database running slowly under heavy load. Back then we only had a fairly basic pentium 4 server running everything.

I'll try this and an initial start (if it doesn't work) and see if I get anywhere. Failing this I could always drop the effected database and/or move it over to a replication setup for the time being and see if it will restart with the other databases on there.

I'm reluctant to dump the database, restart it and then restore the database (which worked for someone on the forum) just incase problems to remain, unfortunately I don't think I have enough ram in my local machine to test it either as it only has 6Gb.

I realised after I had posted the schema I dumped mysql database by mistake, sorry about that hope it didn't cause any problems with your testing!

Can you test with 7.0.7
Setting status to need feedback

Will try with 7.0.7 and update once tested.

Looks like the same problem:

2009-09-04 10:36:43 [MgmSrvr] INFO     -- Node 4: Starting to restore schema
2009-09-04 10:37:29 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Hi again,

Darn...that it wasn't fixed.
So I looked at the traces "last time" and couldnt find problem.

1) Is it a production system ?
2) If no, can you shut it down, take a filesystem backup of ndb_{3/4}_fs
   and upload it to ftp.mysql.com
3) If filesystem backup is not possible, can you take a ndb-backup
   and upload BACKUP-{N}.{3/4}.ctl, this is only meta-data

/Jonas

Hi Jonas,

It's a live system so shutdown not possible. I've actually dropped a number of the databases as they are now running on a separate MySQL Cluster system (running 7.0.7 without issue).
CTL files requested are attached, let me know if you need anything else.

Hi again,

So now I tried with your backup files, and again no problems encountered :-(

Can you attach a full-ndb-error-reporter tarbar for your latest attempt.
(this should include cluster log, error log(s), trace-file(s) and out-files)

The old tracefiles doesnt seem to be accurate any longer (no table/index with id 31 is found in backup files)

Setting to need feedback again

/Jonas

Hi Jonas,

What is the process for doing this?

Does it need any particular command line params to run?

Thanks
Phil.

http://dev.mysql.com/doc/refman/4.1/en/mysql-cluster-programs-ndb-error-reporter.html

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".