Description:
We just upgraded our cluster from 7.0.9 to 7.0.10.
The upgrading on the mgm-node was fine, but when we did a rolling restart on our first ndb-node, the whole cluster crashed and restarted all nodes.
The main problem is, that after that (start phase 3 completed) nothing else happened. That means, the cluster was in a stalled condition (we waited for 40minutes, but there was no log-entry or anything else in this time) so that we had to stop and re-restart the whole thing.
Is there a known incompability between 7.0.9 and 7.0.10?
Please see the ndb_error_report here: http://85.25.144.101/files/ndb_error_report_20100125115356.tar.bz2
Log-Output from mgm:
2010-01-25 11:26:53 [MgmtSrvr] INFO -- Node 2: Node restart starting to copy the fragments to Node 2
2010-01-25 11:26:53 [MgmtSrvr] INFO -- Node 2: Node: 2 StartLog: [GCI Keep: 13251665 LastCompleted: 13252722 NewestRestorable: 13252881]
2010-01-25 11:28:02 [MgmtSrvr] INFO -- Node 3: Local checkpoint 22721 started. Keep GCI = 13252365 oldest restorable GCI = 13252809
2010-01-25 11:36:09 [MgmtSrvr] INFO -- Node 3: Local checkpoint 22722 started. Keep GCI = 13252946 oldest restorable GCI = 13253394
2010-01-25 11:43:34 [MgmtSrvr] INFO -- Node 2: Node restart completed copying the fragments to Node 2
2010-01-25 11:43:35 [MgmtSrvr] ALERT -- Node 4: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:35 [MgmtSrvr] ALERT -- Node 5: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:35 [MgmtSrvr] ALERT -- Node 1: Node 4 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2010-01-25 11:43:36 [MgmtSrvr] ALERT -- Node 1: Node 5 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT -- Node 1: Node 3 Disconnected
2010-01-25 11:43:36 [MgmtSrvr] ALERT -- Node 2: Forced node shutdown completed. Occured during startphase 5. Caused by error 2308: 'Another node failed during system restart, please investigate error(s) on other node(s)(Restart error). Temporary error, restart node'.
2010-01-25 11:43:36 [MgmtSrvr] ALERT -- Node 1: Node 2 Disconnected
How to repeat:
Maybe with "our way" it is possible:
- Have a cluster with 7.0.9 running
- Update mgm to 7.0.10
- Update first node to 7.0.10
- Make a rolling restart