Bug #46488 Starting ndb_mgmd with --initial by mistake gives confusing error
Submitted: 31 Jul 2009 9:59 Modified: 7 Oct 2009 12:57
Reporter: Geert Vanderkelen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1.30-telco-7.0.6 OS:Any
Assigned to: Magnus Blåudd CPU Architecture:Any
Tags: ndb_mgmd
Triage: Triaged: D2 (Serious) / R6 (Needs Assessment) / E6 (Needs Assessment)

[31 Jul 2009 9:59] Geert Vanderkelen
Description:
When killing a ndb_mgmd in a setup with two management nodes and using MySQL Cluster 7.0.6 with it's cached configurations, you get a confusing error when you mistakenly did --initial.

ndb_mgm> show
Connected to Management Server at: ndbsup-priv-1:1406
ERROR Message: The cluster configuration is not yet confirmed by all defined management servers. This management server is still waiting for node  to connect.

Could not get configuration
*  4012: Failed to get configuration
*        The cluster configuration is not yet confirmed by all defined management servers. This management server is still waiting for node  to connect.

Note that the node id is not printed, which looks like a mistake.

How to repeat:
* Start up the ndb_mgmd with clean --configdir, no logs, etc..
node 1:
  shell> ndb_mgmd -f config_70.ini --configdir=. --initial
  
  2009-07-31 11:41:17 [MgmSrvr] INFO     -- Got initial configuration from 'config_70.ini', will try to set it when all ndb_mgmd(s) started
  2009-07-31 11:41:17 [MgmSrvr] INFO     -- Mgmt server state: nodeid 1 reserved for ip 10.100.9.6, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000002.
  2009-07-31 11:41:17 [MgmSrvr] INFO     -- Node 1: Node 1 Connected
  2009-07-31 11:41:17 [MgmSrvr] INFO     -- Id: 1, Command port: *:1406
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Node 2 connected
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Starting initial configuration change
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Configuration 1 commited
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Config change completed! New generation: 1
  
node 2:
  shell> ndb_mgmd -f config_70.ini --configdir=.
  
  All goes well (see also above log from node 1):
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Got initial configuration from 'config_70.ini', will try to set it when all ndb_mgmd(s) started
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Mgmt server state: nodeid 2 reserved for ip 10.100.9.7, m_reserved_nodes 0000000000000000000000000000000000000000000000000000000000000004.
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Node 2: Node 2 Connected
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Id: 2, Command port: *:1406
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Node 2: Node 1 Connected
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Node 1 connected
  2009-07-31 11:41:32 [MgmSrvr] INFO     -- Configuration 1 commited

* Kill the first ndb_mgmd (node 1)
  shell> killall ndb_mgmd
* Start the first ndb_mgmd (node 1), without --initial or --reload, I should not do that because I didn't change, and I want the cached config to be taken.
  shell> ndb_mgmd -f config_70.ini --configdir=.
  
  2009-07-31 11:45:02 [MgmSrvr] INFO     -- NDB Cluster Management Server. mysql-5.1.34 ndb-7.0.6
  2009-07-31 11:45:02 [MgmSrvr] INFO     -- Loaded config from '/data2/users/geert/cluster/master/ndb_1_config.bin.1'

* All works fine.
* Now, kill again the first ndb_mgmd (node 1), and do an --initial by mistake:
  shell> killall ndb_mgmd
  shell> ndb_mgmd -f config_70.ini --configdir=. --initial

  ERROR Message: The cluster configuration is not yet confirmed by all defined management servers. This management server is still waiting for node  to connect.
  
  Note the missing node id in the sentense.

* Trying it again, to recover, you can't.

Suggested fix:
If an --initial is given, but the second ndb_mgmd is up, exit and give a warning that one can not do this?

There seems to be no way out of this situation except cleaning up again on both ndb_mgmd and start over (can be online I guess).
[31 Jul 2009 10:00] Geert Vanderkelen
Maybe same as Bug#45495 ?

Verified using MySQL Cluster 7.0.6.
[31 Jul 2009 10:02] Geert Vanderkelen
Not like Bug#45495, here the ndb_mgmd starts, but is useless although it shows connected.
[7 Aug 2009 8:53] Jonas Oreland
no point in making a new 7.0 release until this has been fixed
[30 Aug 2009 5:34] Geert Vanderkelen
The workaround would 'do not make the mistake to put --initial'.
[1 Sep 2009 8:58] Magnus Blåudd
The log from the running mgmd shows:
2009-09-01 10:48:19 [MgmSrvr] WARNING  -- Refusing other node, it's in different state: 1, expected: 2
after this message the other node is "waiting" - for what? :)

Question is what action to take?
1. The starting ndb_mgmd should stop and report the error, the other node is already up and has a confirmed config.
or 
2. The starting node should grab the config from the other node and continue. If the config.ini of the starting node is exactly the same no warning need to be printed otherwise porint a warning saying that the local config.ini is outdated. This is more inline how it works if the starting mgmd wouldn't have had  a config.ini specified on the command line.
[1 Sep 2009 9:01] Magnus Blåudd
Additional workaround is to not specify which config.ini to use on the command line, The starting ndb_mgmd will then synch up with the other ndb_mgmd and go from there.

node1:
  shell> ndb_mgmd --configdir=. --initial
[7 Oct 2009 12:23] Magnus Blåudd
Fixed by BUG#45495
[7 Oct 2009 12:57] Jon Stephens
Already documented -- see BUG#49495 for changelog entry.

Closed.
[6 Nov 2009 9:36] Gustaf Thorslund
jon: s/BUG#49495/BUG#45495/