Bug #42225 ndbcluster crash
Submitted: 20 Jan 2009 18:24 Modified: 13 Apr 2009 9:49
Reporter: peter cooper Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:Version: 5.0.51 OS:Linux (redhat)
Assigned to: CPU Architecture:Any
Tags: 2305 node lost connection

[20 Jan 2009 18:24] peter cooper
Description:
I have a 3 server cluster setup.

1 mgmd, 2 ndb nodes, 2 mysqld
the mysqld & ndb nodes run on 2 4gig dual core servers.

these have been running fine for 6 months.

Recently made changes to the schema (4 existing table changes & 2 new tables).

the change threw a wobbly and locked the whole database, I had to re-instigate from a backup (10 hrs live data lost).

I added config.ini values from using none (defaults) to:
MaxNoOfTables=4096
MaxNoOfAttributes=24756
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512

This enabled the changes above to be made and all seemed fine.

this was then running fine for 2 days, when this after-noon one node failed and brought down the whole system, forcing me to do a restore and loosing some more live data.  I have since added 2 more configs & increased the previous to:
MaxNoOfTables=4096
MaxNoOfAttributes=24756
MaxNoOfOrderedIndexes=8048
MaxNoOfUniqueHashIndexes=1512
DataMemory=1000M
IndexMemory=250M

Can someone help please, I'm pasting relevant log details and attaching the ndb_error_reporter log

Is it just a case of increasing default values in the config file?  currently we have a database large in schema but small in size of data.

NDB3_error_log
Time: Tuesday 20 January 2009 - 13:20:26
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 4659) 0x0000000e
Program: ndbd
Pid: 19495
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.7
Version: Version 5.0.51
***EOM***

there was no error from the management node.

NDB3_out_log:
2009-01-20 13:20:26 [ndbd] INFO     -- Error handler shutting down system
2009-01-20 13:20:27 [ndbd] INFO     -- Error handler shutdown completed - exiting
2009-01-20 13:20:27 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2009-01-20 13:50:01 [ndbd] INFO     -- Angel pid: 17234 ndb pid: 17235
2009-01-20 13:50:01 [ndbd] INFO     -- NDB Cluster -- DB node 3
2009-01-20 13:50:01 [ndbd] INFO     -- Version 5.0.51 --
2009-01-20 13:50:01 [ndbd] INFO     -- Configuration fetched at 77.92.68.84 port 1186
2009-01-20 13:50:01 [ndbd] INFO     -- Start initiated (version 5.0.51)
2009-01-20 13:52:12 [ndbd] INFO     -- Angel pid: 17295 ndb pid: 17296
2009-01-20 13:52:12 [ndbd] INFO     -- NDB Cluster -- DB node 3
2009-01-20 13:52:12 [ndbd] INFO     -- Version 5.0.51 --
2009-01-20 13:52:12 [ndbd] INFO     -- Configuration fetched at 77.92.68.84 port 1186
2009-01-20 13:52:13 [ndbd] INFO     -- Start initiated (version 5.0.51)
Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection.
2009-01-20 13:58:04 [ndbd] INFO     -- Angel pid: 17362 ndb pid: 17363
2009-01-20 13:58:04 [ndbd] INFO     -- NDB Cluster -- DB node 3
2009-01-20 13:58:04 [ndbd] INFO     -- Version 5.0.51 --
2009-01-20 13:58:04 [ndbd] INFO     -- Configuration fetched at 77.92.68.84 port 1186
2009-01-20 13:58:05 [ndbd] INFO     -- Start initiated (version 5.0.51)
2009-01-20 14:15:09 [ndbd] INFO     -- Angel pid: 18014 ndb pid: 18015
2009-01-20 14:15:09 [ndbd] INFO     -- NDB Cluster -- DB node 3
2009-01-20 14:15:09 [ndbd] INFO     -- Version 5.0.51 --
2009-01-20 14:15:09 [ndbd] INFO     -- Configuration fetched at 77.92.68.84 port 1186
2009-01-20 14:15:09 [ndbd] INFO     -- Start initiated (version 5.0.51)

How to repeat:
not possible to force a repeat, more of a case of it just happened and I don't know why
[20 Jan 2009 18:25] peter cooper
ndb_error_reprter file

Attachment: ndb_error_report_20090120181725.tar.bz2 (application/octet-stream, text), 442.09 KiB.

[20 Jan 2009 18:27] peter cooper
updated to S2 serious
[13 Mar 2009 9:49] Jonas Oreland
Hi,

My *guess* is that you're increase in config, caused ndbd to allocate 
sufficiently more memory to make the machine start swapping.
swapping is frequently the cause of missed heartbeats...

You can use "LockPagesInMemory" to make sure that ndbd never is forced in to swap (needs to be root to use it).
Also, ndbd should crash during startup if using "LockPagesInMemory" and the machine does not have sufficient amount of RAM.

Setting status to need feedback,

/Jonas
[13 Apr 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".