Bug #45385 Online upgrade 6.3.24 -> 7.0.x fails
Submitted: 9 Jun 2009 3:05 Modified: 15 Oct 2009 12:24
Reporter: Phil Bayfield Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.0 OS:Linux
Assigned to: Assigned Account CPU Architecture:Any
Tags: 7.0.6, 7.0.7

[9 Jun 2009 3:05] Phil Bayfield
Description:
I built 7.0.6 from launchpad code rev 2925 tagged as mysql-5.1.34-telco-7.0.6

Works fine in testing but online upgrade fails, node won't start:

Time: Tuesday 9 June 2009 - 02:47:00
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 17190) 0x0000000a
Program: ndbd
Pid: 13600
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.9
Version: mysql-5.1.34 ndb-7.0.6
***EOM***

Time: Tuesday 9 June 2009 - 02:52:36
Status: Temporary error, restart node
Message: Internal program error (failed ndbrequire) (Internal error, programming error or missing error message, please report a bug)
Error: 2341
Error data: dbdict/Dbdict.cpp
Error object: DBDICT (Line: 17190) 0x0000000a
Program: ndbd
Pid: 13655
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.10
Version: mysql-5.1.34 ndb-7.0.6
***EOM***

How to repeat:
Online upgrade from 6.3.24
[9 Jun 2009 3:15] Phil Bayfield
ndb_mgmd logs:

2009-06-09 02:48:08 [MgmSrvr] INFO     -- Node 1: Node 4 Connected
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 3: Node 4 Connected
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 3: Node 4: API mysql-5.1.34 ndb-7.0.6
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Node 3 Connected
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: CM_REGCONF president = 3, own Node = 4, our dynamic id = 11
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Node 3: API mysql-5.1.32 ndb-6.3.24
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Start phase 1 completed
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Receive arbitrator node 1 [ticket=0b6800342a074cca]
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Start phase 2 completed (node restart)
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 3: DICT: locked by node 4 for NodeRestart
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Start phase 3 completed (node restart)
2009-06-09 02:48:11 [MgmSrvr] INFO     -- Node 4: Start phase 4 completed (node restart)
2009-06-09 02:52:02 [MgmSrvr] INFO     -- Node 4: Starting to restore schema
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 3: Node 4 Disconnected
2009-06-09 02:52:37 [MgmSrvr] INFO     -- Node 3: Communication to Node 4 closed
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 3: Network partitioning - arbitration required
2009-06-09 02:52:37 [MgmSrvr] INFO     -- Node 3: President restarts arbitration thread [state=7]
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
2009-06-09 02:52:37 [MgmSrvr] ALERT    -- Node 3: Arbitration won - positive reply from node 1
2009-06-09 02:52:37 [MgmSrvr] INFO     -- Node 3: DICT: remove lock by failed node 4 for NodeRestart
2009-06-09 02:52:37 [MgmSrvr] INFO     -- Node 3: DICT: unlocked by node 4 for NodeRestart
2009-06-09 02:52:38 [MgmSrvr] ALERT    -- Node 1: Node 4 Disconnected
[9 Jun 2009 5:31] Jonas Oreland
Hi,

Can you please upload
- "full"/zipped cluster-log,
- schema (e.g from mysqldump or similar)

/Jonas
[9 Jun 2009 17:37] Jonas Oreland
it's definitely index related, in particular index 31
can you on your 6.3 cluster run
> ndb_show_tables | grep "31"

which will show the name of index 31

/Jonas
[9 Jun 2009 20:37] Phil Bayfield
This is what it shows for number 31:

31    IndexTrigger         Online   -                             NDB$INDEX_60_CUSTOM
31    OrderedIndex         Online   No      sys          def      PRIMARY
[16 Jun 2009 12:00] Jonas Oreland
Hi,

I've now tested with your schema file...and it worked fine :-(
(comment: you probably shouldn't dump the "mysql"-database)

For the 783 error (which I also got) I filed http://bugs.mysql.com/45525

---

1) There should also be somewhat interesting information in the out-file of the datanode(s).
2) You can try starting the 7.0.6 binary "--initial" (but make an OS-backup first)

---

Not really sure how to proceed.
[16 Jun 2009 16:09] Phil Bayfield
I'll make some changes to the live db schema to reflect what I got to work in the sandbox setup. I used the same binary I compiled so maybe it will work (or maybe not!) :)

I've noticed on the Cluster forum I'm also not the only one having this issue as well.

It wasnt liking a couple of tables with many indexes (15?) and on a UNIQUE key, which when changed to a normal index was fine. The database in question was something we built a few years ago and probably isn't the best designed schema. I have no idea why there were so many indexes in all honesty, I think possibly it was an attempt to speed up problems with the database running slowly under heavy load. Back then we only had a fairly basic pentium 4 server running everything.

I'll try this and an initial start (if it doesn't work) and see if I get anywhere. Failing this I could always drop the effected database and/or move it over to a replication setup for the time being and see if it will restart with the other databases on there.

I'm reluctant to dump the database, restart it and then restore the database (which worked for someone on the forum) just incase problems to remain, unfortunately I don't think I have enough ram in my local machine to test it either as it only has 6Gb.

I realised after I had posted the schema I dumped mysql database by mistake, sorry about that hope it didn't cause any problems with your testing!
[3 Sep 2009 9:47] Jonas Oreland
Can you test with 7.0.7
Setting status to need feedback
[4 Sep 2009 10:33] Phil Bayfield
Will try with 7.0.7 and update once tested.
[4 Sep 2009 10:39] Phil Bayfield
Looks like the same problem:

2009-09-04 10:36:43 [MgmSrvr] INFO     -- Node 4: Starting to restore schema
2009-09-04 10:37:29 [MgmSrvr] ALERT    -- Node 4: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
[4 Sep 2009 12:32] Jonas Oreland
Hi again,

Darn...that it wasn't fixed.
So I looked at the traces "last time" and couldnt find problem.

1) Is it a production system ?
2) If no, can you shut it down, take a filesystem backup of ndb_{3/4}_fs
   and upload it to ftp.mysql.com
3) If filesystem backup is not possible, can you take a ndb-backup
   and upload BACKUP-{N}.{3/4}.ctl, this is only meta-data

/Jonas
[4 Sep 2009 14:07] Phil Bayfield
Hi Jonas,

It's a live system so shutdown not possible. I've actually dropped a number of the databases as they are now running on a separate MySQL Cluster system (running 7.0.7 without issue).
CTL files requested are attached, let me know if you need anything else.
[10 Sep 2009 8:49] Jonas Oreland
Hi again,

So now I tried with your backup files, and again no problems encountered :-(

Can you attach a full-ndb-error-reporter tarbar for your latest attempt.
(this should include cluster log, error log(s), trace-file(s) and out-files)

The old tracefiles doesnt seem to be accurate any longer (no table/index with id 31 is found in backup files)

Setting to need feedback again

/Jonas
[10 Sep 2009 23:07] Phil Bayfield
Hi Jonas,

What is the process for doing this?

Does it need any particular command line params to run?

Thanks
Phil.
[11 Sep 2009 7:21] Jonas Oreland
http://dev.mysql.com/doc/refman/4.1/en/mysql-cluster-programs-ndb-error-reporter.html
[15 Oct 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".