Bug #50587 Restart of all nodes after rolling upgrade from 7.0.9 to 7.0.10
Submitted: 25 Jan 2010 11:57 Modified: 25 Jan 2010 13:13
Reporter: Robert Klikics Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:mysql-5.1-telco-7.0 OS:Linux
Assigned to: CPU Architecture:Any
Tags: 7-telco, cluster, crash, rolling restart

[25 Jan 2010 11:57] Robert Klikics
Description:
We just upgraded our cluster from 7.0.9 to 7.0.10.
The upgrading on the mgm-node was fine, but when we did a rolling restart on our first ndb-node, the whole cluster crashed and restarted all nodes. 

The main problem is, that after that (start phase 3 completed) nothing else happened. That means, the cluster was in a stalled condition (we waited for 40minutes, but there was no log-entry or anything else in this time) so that we had to stop and re-restart the whole thing.

Is there a known incompability between 7.0.9 and 7.0.10?

Please see the ndb_error_report here: http://85.25.144.101/files/ndb_error_report_20100125115356.tar.bz2

Log-Output from mgm:

2010-01-25 11:26:53 [MgmtSrvr] INFO     -- Node 2: Node restart starting to copy the fragments to Node 2
2010-01-25 11:26:53 [MgmtSrvr] INFO     -- Node 2: Node: 2 StartLog: [GCI Keep: 13251665 LastCompleted: 13252722 NewestRestorable: 13252881]
2010-01-25 11:28:02 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 22721 started. Keep GCI = 13252365 oldest restorable GCI = 13252809
2010-01-25 11:36:09 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 22722 started. Keep GCI = 13252946 oldest restorable GCI = 13253394
2010-01-25 11:43:34 [MgmtSrvr] INFO     -- Node 2: Node restart completed copying the fragments to Node 2
2010-01-25 11:43:35 [MgmtSrvr] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:35 [MgmtSrvr] ALERT    -- Node 5: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:35 [MgmtSrvr] ALERT    -- Node 1: Node 4 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 1: Node 5 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
2010-01-25 11:43:36 [MgmtSrvr] ALERT    -- Node 1: Node 2 Disconnected

How to repeat:
Maybe with "our way" it is possible:

- Have a cluster with 7.0.9 running
- Update mgm to 7.0.10
- Update first node to 7.0.10
- Make a rolling restart
[25 Jan 2010 13:01] Hartmut Holzgraefe
Duplicate of bug #50433
[25 Jan 2010 13:13] Robert Klikics
Hi,

thanks for the link.. 

It would be appropriate to announce such things in the cluster mailinglist. Or did I miss it?