MySQL Bugs: #35954: cluster 5.1.22-rc rolling upgrade to 5.1.23-rc broken

Bug #35954	cluster 5.1.22-rc rolling upgrade to 5.1.23-rc broken
Submitted:	10 Apr 2008 6:12	Modified:	18 May 2009 13:57
Reporter:	Sean Pringle	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	5.1.22-rc	OS:	Any
Assigned to:	Sean Pringle	CPU Architecture:	Any
Description:
A simple cluster upgrade from 5.1.22-rc to 5.1.23-rc via rolling restart fails if more than two data nodes are in use.

Specifically, the second data node in a node group to be upgraded fails with error 2341 in start phase 5 if *any* another 5.1.22-rc data node is still running.

The work around is to first upgrade *one* data node from each node group to 5.1.23-rc, then stop *all* the remaining 5.1.22-rc nodes simultaneously, before finally upgrading and restarting those one at a time.

Is 5.1.22-rc to 5.1.23-rc cluster upgrade via rolling restart expected to work?

How to repeat:
1.  Setup 5.1.22-rc cluster with one (1) management node, four (4) data nodes:

<config.ini>
[NDB_MGMD]
hostname=localhost
datadir=/path/to/somewhere

[NDBD DEFAULT]
NoOfReplicas=2
datadir=/path/to/somewhere
[NDBD]
[NDBD]
[NDBD]
[NDBD]

[MYSQLD]
[MYSQLD]
[MYSQLD]
</config.ini>

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.22, Nodegroup: 0, Master)
id=3    @127.0.0.1  (Version: 5.1.22, Nodegroup: 0)
id=4    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)
id=5    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.22)

[mysqld(API)]   3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)

2.  Stop ndb_mgmd process, upgrade, and restart.

ndb_mgm> show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.22, Nodegroup: 0, Master)
id=3    @127.0.0.1  (Version: 5.1.22, Nodegroup: 0)
id=4    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)
id=5    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.23)

[mysqld(API)]   3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)

3.  Stop first data node of node group 0, upgrade, and restart.

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.23, Nodegroup: 0)
id=3    @127.0.0.1  (Version: 5.1.22, Nodegroup: 0, Master)
id=4    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)
id=5    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.23)

[mysqld(API)]   3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)

4.  Stop second data node of node group 0, upgrade, and restart.  Observe error 2341 crash of that node.

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.23, Nodegroup: 0)
id=3 (not connected, accepting connect from mythago)
id=4    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1, Master)
id=5    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.23)

[mysqld(API)]   3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)

Node 3: Forced node shutdown completed. Occured during startphase 5. Caused by error 2341: 'Internal program error (failed ndbrequire)(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.

Work around can be applied with the cluster in this degraded state as follows:

1.  Stop first node of node group 1, upgrade, and restart specifying nodeid manually so that ndbd attaches to node group 1:

shell> ndbd -c nodeid=4

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.23, Nodegroup: 0)
id=3 (not connected, accepting connect from mythago)
id=4    @127.0.0.1  (Version: 5.1.23, Nodegroup: 1)
id=5    @127.0.0.1  (Version: 5.1.22, Nodegroup: 1, Master)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.23)

[mysqld(API)]   3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)

2.  Stop second node of node group 1 so we have now all 5.1.23-rc:

ndb_mgm> 5 stop
Node 5: Node shutdown initiated
Node 5: Node shutdown completed.
Node 5 has shutdown.

ndb_mgm> 
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.23, Nodegroup: 0, Master)
id=3 (not connected, accepting connect from mythago)
id=4    @127.0.0.1  (Version: 5.1.23, Nodegroup: 1)
id=5 (not connected, accepting connect from mythago)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.23)

[mysqld(API)]   3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)

3.  Proceed to upgrade and start remaining nodes:

ndb_mgm> Node 3: Started (version 5.1.23)
Node 5: Started (version 5.1.23)

ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]     4 node(s)
id=2    @127.0.0.1  (Version: 5.1.23, Nodegroup: 0, Master)
id=3    @127.0.0.1  (Version: 5.1.23, Nodegroup: 0)
id=4    @127.0.0.1  (Version: 5.1.23, Nodegroup: 1)
id=5    @127.0.0.1  (Version: 5.1.23, Nodegroup: 1)

[ndb_mgmd(MGM)] 1 node(s)
id=1    @127.0.0.1  (Version: 5.1.23)

[mysqld(API)]   3 node(s)
id=6 (not connected, accepting connect from any host)
id=7 (not connected, accepting connect from any host)
id=8 (not connected, accepting connect from any host)

Suggested fix:
1.  Document this problem http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-upgrade-downgrade-compatibility.html

2.  Determine if the work around is a safe solution.