MySQL Bugs: #50587: Restart of all nodes after rolling upgrade from 7.0.9 to 7.0.10

Bug #50587	Restart of all nodes after rolling upgrade from 7.0.9 to 7.0.10
Submitted:	25 Jan 2010 11:57	Modified:	25 Jan 2010 13:13
Reporter:	Robert Klikics	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-7.0	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	7-telco, cluster, crash, rolling restart

Description:
We just upgraded our cluster from 7.0.9 to 7.0.10.
The upgrading on the mgm-node was fine, but when we did a rolling restart on our first ndb-node, the whole cluster crashed and restarted all nodes. 

The main problem is, that after that (start phase 3 completed) nothing else happened. That means, the cluster was in a stalled condition (we waited for 40minutes, but there was no log-entry or anything else in this time) so that we had to stop and re-restart the whole thing.

Is there a known incompability between 7.0.9 and 7.0.10?

Please see the ndb_error_report here: http://85.25.144.101/files/ndb_error_report_20100125115356.tar.bz2

Log-Output from mgm:

2010-01-25 11:26:53 [MgmtSrvr] INFO     -- Node 2: Node restart starting to copy the fragments to Node 2
2010-01-25 11:26:53 [MgmtSrvr] INFO     -- Node 2: Node: 2 StartLog: [GCI Keep: 13251665 LastCompleted: 13252722 NewestRestorable: 13252881]
2010-01-25 11:28:02 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 22721 started. Keep GCI = 13252365 oldest restorable GCI = 13252809
2010-01-25 11:36:09 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 22722 started. Keep GCI = 13252946 oldest restorable GCI = 13253394
2010-01-25 11:43:34 [MgmtSrvr] INFO     -- Node 2: Node restart completed copying the fragments to Node 2
2010-01-25 11:43:35 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:35 [MgmtSrvr] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:35 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 1: Node 5 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected

How to repeat:
Maybe with "our way" it is possible:

- Have a cluster with 7.0.9 running
- Update mgm to 7.0.10
- Update first node to 7.0.10
- Make a rolling restart

Duplicate of bug #50433

Hi,

thanks for the link.. 

It would be appropriate to announce such things in the cluster mailinglist. Or did I miss it?