Bug #44488 | ndb_mgmd is busy 4 min in startup (24 datanodes) | ||
---|---|---|---|
Submitted: | 27 Apr 2009 8:27 | Modified: | 15 May 2009 13:04 |
Reporter: | Henrik Ingo | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S3 (Non-critical) |
Version: | MySQL Cluster 7.0.4 | OS: | Solaris |
Assigned to: | Magnus Blåudd | CPU Architecture: | Any |
[27 Apr 2009 8:27]
Henrik Ingo
[29 Apr 2009 11:58]
Henrik Ingo
This is clearly relevant so I add it here rather than as a separate bug: The whole system is 24 datanodes, 64 mysqld nodes and using ndb-connection-pool=2 so 128 api slots in config.ini. It seems a cluster of this size has startup issues when all nodes start simultaneously: - ndb_mgmd takes minutes to start with one thread using up 1 cpu (thread) - some ndbmtd:s seem to also have rather busy threads, but not all of them - in the end, some mysqld's are not connected to the cluster and hence, ndb databases/tables do not appear on those mysqlds after being created. The error.log of mysql25 with the missing [DBname] is: root@xxx144> cat xxx144.err 090422 11:25:30 mysqld_safe Starting mysqld daemon with databases from /usr/local/mysql/var/mysql25/ 090422 11:25:31 [Note] NDB: NodeID is 49, management server 'xxx144:1186' 090422 11:25:33 [Note] NDB[0]: NodeID: 49, no storage nodes connected (timed out) 090422 11:25:35 [Note] NDB[1]: NodeID: 75, no storage nodes connected (timed out) 090422 11:25:35 [Warning] NDB: server id set to zero will cause any other mysqld with bin log to log with wrong server id 090422 11:25:35 [Note] Starting Cluster Binlog Thread 090422 11:25:36 [Note] /usr/local/mysql-5.1.32-ndb-7.0.4-innodb-solaris10-sparc/libexec/mysqld: ready for connections. Version: '5.1.32-ndb-7.0.4-innodb' socket: '/tmp/mysql.25.sock' port: 9324 Source distribution 090422 11:25:40 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_schema 090422 11:25:40 [ERROR] NDB Binlog: Creating NdbEventOperation failed for REPL$mysql/ndb_schema 090422 11:25:40 [ERROR] NDB Binlog:FAILED CREATE (DISCOVER) EVENT OPERATIONS Event: REPL$mysql/ndb_schema 090422 11:25:43 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_apply_status 090422 11:25:43 [ERROR] NDB Binlog: Creating NdbEventOperation failed for REPL$mysql/ndb_apply_status 090422 11:25:43 [ERROR] NDB Binlog:FAILED CREATE (DISCOVER) EVENT OPERATIONS Event: REPL$mysql/ndb_apply_status 2009-04-22 11:25:44 [NdbApi] INFO -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 145 Time out talking to management server Error line: 464 090422 11:25:45 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_schema 090422 11:25:45 [ERROR] NDB Binlog: Creating NdbEventOperation failed for REPL$mysql/ndb_schema 090422 11:25:45 [ERROR] NDB Binlog:FAILED CREATE (DISCOVER) EVENT OPERATIONS Event: REPL$mysql/ndb_schema 090422 11:25:45 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_apply_status 090422 11:25:45 [ERROR] NDB Binlog: Creating NdbEventOperation failed for REPL$mysql/ndb_apply_status 090422 11:25:45 [ERROR] NDB Binlog:FAILED CREATE (DISCOVER) EVENT OPERATIONS Event: REPL$mysql/ndb_apply_status 2009-04-22 11:25:49 [NdbApi] INFO -- Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection. 145 Time out talking to management server Error line: 464
[29 Apr 2009 12:06]
Henrik Ingo
(forwarding from customer) When I restarted ndb_mgmd this morning (with --initial), it again took 4 minutes to start: 2009-04-23 10:04:12 [MgmSrvr] INFO -- Got initial configuration from '/etc/mysql/config.ini', will try to set it when all ndb_mgmd(s) started 2009-04-23 10:04:12 [MgmSrvr] INFO -- Mgmt server state: nodeid 1 reserved for ip 10.44.22.241, m_reserved_nodes 00000000000000000000000000000000000 00000000000000000000000000002. 2009-04-23 10:04:12 [MgmSrvr] INFO -- Node 1: Node 1 Connected 2009-04-23 10:04:12 [MgmSrvr] INFO -- Id: 1, Command port: *:1186 2009-04-23 10:04:12 [MgmSrvr] INFO -- Starting initial configuration change 2009-04-23 10:08:19 [MgmSrvr] INFO -- Configuration 1 commited 2009-04-23 10:08:19 [MgmSrvr] INFO -- Config change completed! New generation: 1 This time I collected prstat's and pstack's showing that ndb_mgmd is using one entire CPU is user time in findKey(...): 23607: /usr/local/mysql/libexec/ndb_mgmd --ndb-nodeid=1 --reload --initial -- 00000001001235f8 bool findKey(const unsigned*,unsigned,unsigned,unsigned*) (101206fe0, 6ebb, 117381c6, ffffffff7fffd8ac, ffffffff7f200200, 6b45) + 180 0000000100123ffc bool ConfigValuesFactory::put(const ConfigValues::Entry&) (ffffffff7fffe900, ffffffff7fffd9a0, ffffffff7fffd9a0, 6ebb, 100000, ffffffff7e704500) + f4 Then it's forking itself, and now thread #7 is again using 100% of one CPU in user time, again in findKey(...): ----------------- lwp# 7 / thread# 7 -------------------- 00000001001235b0 bool findKey(const unsigned*,unsigned,unsigned,unsigned*) (10144fff0, 11413, 4e7ffe, ffffffff7d50ef0c, 0, 1fe4) + 138 0000000100122d1c bool ConfigValues::ConstIterator::get(unsigned,ConfigValues::Entry*)const (ffffffff7d50f2e8, 3ffe, ffffffff7d50efe0, 10486e36e, 1, 10486e36e) + 2c Attached are the stack's (that include some prstat data as well).
[29 Apr 2009 12:10]
Henrik Ingo
ndb_mgmd.pstack
Attachment: ndb_mgmd.pstack (application/octet-stream, text), 10.70 KiB.
[29 Apr 2009 12:15]
Henrik Ingo
ndb_mgmd_forked.pstack
Attachment: ndb_mgmd_forked.pstack (application/octet-stream, text), 58.71 KiB.
[29 Apr 2009 12:30]
Magnus Blåudd
Reproduced locally. ndb_mgmd ran with one core at 100% more than a minute to start from the given configuration file.
[29 Apr 2009 21:11]
Magnus Blåudd
The problem occurs when we save the configuration in ConfigValues(the binary representation of a configuration). The container will dynamically expand itself each 32nd item. Unfortunately it's an expensive operation to exapnd the container, allocating new memory and copying from old to new. Especially when we have ~70000 items. Have written a loop that estimates the number of itens and that way we should get away with only one initial expand which is fast.
[30 Apr 2009 16:01]
Magnus Blåudd
There is also another cpu hog in 'copy_dynamic_ports' that run each time the configuration changes. Will have to rewrite that function as well. Either using a lookup table or simply by not keeping the dynamic ports in the configuration. Two options to consider.
[13 May 2009 13:14]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/73941
[13 May 2009 14:11]
Bugs System
Pushed into 5.1.34-ndb-7.0.6 (revid:magnus.blaudd@sun.com-20090513133517-l7vb65braweonf43) (version source revid:magnus.blaudd@sun.com-20090513133517-l7vb65braweonf43) (merge vers: 5.1.34-ndb-7.0.6) (pib:6)
[14 May 2009 8:31]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/74010
[14 May 2009 17:50]
Bugs System
Pushed into 5.1.34-ndb-7.0.6 (revid:magnus.blaudd@sun.com-20090514174907-6cf0shl3utsxtgzr) (version source revid:magnus.blaudd@sun.com-20090514174907-6cf0shl3utsxtgzr) (merge vers: 5.1.34-ndb-7.0.6) (pib:6)
[14 May 2009 19:46]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/74103
[14 May 2009 20:08]
Bugs System
Pushed into 5.1.34-ndb-7.0.6 (revid:magnus.blaudd@sun.com-20090514194749-5zeuy3cy5sy9ccrl) (version source revid:magnus.blaudd@sun.com-20090514194749-5zeuy3cy5sy9ccrl) (merge vers: 5.1.34-ndb-7.0.6) (pib:6)
[15 May 2009 13:04]
Jon Stephens
Documented bugfix in the NDB-7.0.6 changelog as follows: When using large numbers of configuration parameters, the management server took an excessive amount of time (several minutes or more) to load these from the configuration cache when starting. This problem occurred when there were more than 32 configuration parameters specified, and became progressively worse with each additional multiple of 32 configuration parameters.