Description:
I have a 3 server cluster setup.
1 mgmd, 2 ndb nodes, 2 mysqld
the mysqld & ndb nodes run on 2 4gig dual core servers.
these have been running fine for 6 months.
Recently made changes to the schema (4 existing table changes & 2 new tables).
the change threw a wobbly and locked the whole database, I had to re-instigate from a backup (10 hrs live data lost).
I added config.ini values from using none (defaults) to:
MaxNoOfTables=4096
MaxNoOfAttributes=24756
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
This enabled the changes above to be made and all seemed fine.
this was then running fine for 2 days, when this after-noon one node failed and brought down the whole system, forcing me to do a restore and loosing some more live data. I have since added 2 more configs & increased the previous to:
MaxNoOfTables=4096
MaxNoOfAttributes=24756
MaxNoOfOrderedIndexes=8048
MaxNoOfUniqueHashIndexes=1512
DataMemory=1000M
IndexMemory=250M
Can someone help please, I'm pasting relevant log details and attaching the ndb_error_reporter log
Is it just a case of increasing default values in the config file? currently we have a database large in schema but small in size of data.
NDB3_error_log
Time: Tuesday 20 January 2009 - 13:20:26
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 4659) 0x0000000e
Program: ndbd
Pid: 19495
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.7
Version: Version 5.0.51
***EOM***
there was no error from the management node.
NDB3_out_log:
2009-01-20 13:20:26 [ndbd] INFO -- Error handler shutting down system
2009-01-20 13:20:27 [ndbd] INFO -- Error handler shutdown completed - exiting
2009-01-20 13:20:27 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2009-01-20 13:50:01 [ndbd] INFO -- Angel pid: 17234 ndb pid: 17235
2009-01-20 13:50:01 [ndbd] INFO -- NDB Cluster -- DB node 3
2009-01-20 13:50:01 [ndbd] INFO -- Version 5.0.51 --
2009-01-20 13:50:01 [ndbd] INFO -- Configuration fetched at 77.92.68.84 port 1186
2009-01-20 13:50:01 [ndbd] INFO -- Start initiated (version 5.0.51)
2009-01-20 13:52:12 [ndbd] INFO -- Angel pid: 17295 ndb pid: 17296
2009-01-20 13:52:12 [ndbd] INFO -- NDB Cluster -- DB node 3
2009-01-20 13:52:12 [ndbd] INFO -- Version 5.0.51 --
2009-01-20 13:52:12 [ndbd] INFO -- Configuration fetched at 77.92.68.84 port 1186
2009-01-20 13:52:13 [ndbd] INFO -- Start initiated (version 5.0.51)
Management server closed connection early. It is probably being shut down (or has problems). We will retry the connection.
2009-01-20 13:58:04 [ndbd] INFO -- Angel pid: 17362 ndb pid: 17363
2009-01-20 13:58:04 [ndbd] INFO -- NDB Cluster -- DB node 3
2009-01-20 13:58:04 [ndbd] INFO -- Version 5.0.51 --
2009-01-20 13:58:04 [ndbd] INFO -- Configuration fetched at 77.92.68.84 port 1186
2009-01-20 13:58:05 [ndbd] INFO -- Start initiated (version 5.0.51)
2009-01-20 14:15:09 [ndbd] INFO -- Angel pid: 18014 ndb pid: 18015
2009-01-20 14:15:09 [ndbd] INFO -- NDB Cluster -- DB node 3
2009-01-20 14:15:09 [ndbd] INFO -- Version 5.0.51 --
2009-01-20 14:15:09 [ndbd] INFO -- Configuration fetched at 77.92.68.84 port 1186
2009-01-20 14:15:09 [ndbd] INFO -- Start initiated (version 5.0.51)
How to repeat:
not possible to force a repeat, more of a case of it just happened and I don't know why