Description:
A simple cluster upgrade from 5.1.22-rc to 5.1.23-rc via rolling restart fails if more than two data nodes are in use.
Specifically, the second data node in a node group to be upgraded fails with error 2341 in start phase 5 if *any* another 5.1.22-rc data node is still running.
The work around is to first upgrade *one* data node from each node group to 5.1.23-rc, then stop *all* the remaining 5.1.22-rc nodes simultaneously, before finally upgrading and restarting those one at a time.
Is 5.1.22-rc to 5.1.23-rc cluster upgrade via rolling restart expected to work?
How to repeat:
1. Setup 5.1.22-rc cluster with one (1) management node, four (4) data nodes:
<config.ini>
[NDB_MGMD]
hostname=localhost
datadir=/path/to/somewhere
[NDBD DEFAULT]
NoOfReplicas=2
datadir=/path/to/somewhere
[NDBD]
[NDBD]
[NDBD]
[NDBD]
[MYSQLD]
[MYSQLD]
[MYSQLD]
</config.ini>
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @127.0.0.1 (Version: 5.1.22, Nodegroup: 0, Master)
id=3 @127.0.0.1 (Version: 5.1.22, Nodegroup: 0)
id=4 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1)
id=5 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @127.0.0.1 (Version: 5.1.22)
[mysqld(API)] 3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
2. Stop ndb_mgmd process, upgrade, and restart.
ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @127.0.0.1 (Version: 5.1.22, Nodegroup: 0, Master)
id=3 @127.0.0.1 (Version: 5.1.22, Nodegroup: 0)
id=4 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1)
id=5 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @127.0.0.1 (Version: 5.1.23)
[mysqld(API)] 3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
3. Stop first data node of node group 0, upgrade, and restart.
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @127.0.0.1 (Version: 5.1.23, Nodegroup: 0)
id=3 @127.0.0.1 (Version: 5.1.22, Nodegroup: 0, Master)
id=4 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1)
id=5 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @127.0.0.1 (Version: 5.1.23)
[mysqld(API)] 3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
4. Stop second data node of node group 0, upgrade, and restart. Observe error 2341 crash of that node.
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @127.0.0.1 (Version: 5.1.23, Nodegroup: 0)
id=3 (not connected, accepting connect from mythago)
id=4 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1, Master)
id=5 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @127.0.0.1 (Version: 5.1.23)
[mysqld(API)] 3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
Work around can be applied with the cluster in this degraded state as follows:
1. Stop first node of node group 1, upgrade, and restart specifying nodeid manually so that ndbd attaches to node group 1:
shell> ndbd -c nodeid=4
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @127.0.0.1 (Version: 5.1.23, Nodegroup: 0)
id=3 (not connected, accepting connect from mythago)
id=4 @127.0.0.1 (Version: 5.1.23, Nodegroup: 1)
id=5 @127.0.0.1 (Version: 5.1.22, Nodegroup: 1, Master)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @127.0.0.1 (Version: 5.1.23)
[mysqld(API)] 3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
2. Stop second node of node group 1 so we have now all 5.1.23-rc:
ndb_mgm> 5 stop
Node 5: Node shutdown initiated
Node 5: Node shutdown completed.
Node 5 has shutdown.
ndb_mgm>
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @127.0.0.1 (Version: 5.1.23, Nodegroup: 0, Master)
id=3 (not connected, accepting connect from mythago)
id=4 @127.0.0.1 (Version: 5.1.23, Nodegroup: 1)
id=5 (not connected, accepting connect from mythago)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @127.0.0.1 (Version: 5.1.23)
[mysqld(API)] 3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
3. Proceed to upgrade and start remaining nodes:
ndb_mgm> Node 3: Started (version 5.1.23)
Node 5: Started (version 5.1.23)
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)] 4 node(s)
id=2 @127.0.0.1 (Version: 5.1.23, Nodegroup: 0, Master)
id=3 @127.0.0.1 (Version: 5.1.23, Nodegroup: 0)
id=4 @127.0.0.1 (Version: 5.1.23, Nodegroup: 1)
id=5 @127.0.0.1 (Version: 5.1.23, Nodegroup: 1)
[ndb_mgmd(MGM)] 1 node(s)
id=1 @127.0.0.1 (Version: 5.1.23)
[mysqld(API)] 3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)
Suggested fix:
1. Document this problem http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-upgrade-downgrade-compatibility.html
2. Determine if the work around is a safe solution.