Description:
I am running the following configuration:
1 Mgm node
2 API nodes
2 Data nodes
I am using a Xen server with CentOS 5 running on Dom0 and DomUs. It is a test system. Each DomU is running on 2G RAM. Database is not big at all (just o couple of rows).
The cluster runs fine for 7-8 days and then all nodes, except the MGM nodes, become disconnected.
Here is the config.ini:
# Options affecting ndbd processes on all data nodes:
[ndbd default]
NoOfReplicas=2 # Number of replicas
DataMemory=80M # How much memory to allocate for data storage
IndexMemory=18M # How much memory to allocate for index storage
# For DataMemory and IndexMemory, we have used the
# default values. Since the "world" database takes up
# only about 500KB, this should be more than enough for
# this example Cluster setup.
TransactionDeadlockDetectionTimeout=8000
# TCP/IP options:
#[tcp default]
#portnumber=2202 # This the default; however, you can use any port that is free
# for all the hosts in the cluster
# Note: It is recommended that you do not specify the port
# number at all and allow the default value to be used instead
# Management process options:
[ndb_mgmd]
hostname=172.17.1.18 # Hostname or IP address of MGM node
#datadir=/var/lib/mysql-cluster # Directory for MGM node log files
# Options for data node "A":
[ndbd]
# (one [ndbd] section per data node)
hostname=172.17.1.8 # Hostname or IP address
datadir=/usr/local/mysql/data # Directory for this data node's data files
# Options for data node "B":
[ndbd]
hostname=172.17.1.9 # Hostname or IP address
datadir=/usr/local/mysql/data # Directory for this data node's data files
# SQL node options:
[mysqld]
hostname=172.17.1.17
[mysqld]
hostname=172.17.1.16
[mysqld default]
[tcp default]
The following is the log file:
2008-07-26 20:20:19 [MgmSrvr] INFO -- Node 2: Local checkpoint 211 started. Keep GCI = 386462 oldest restorable GCI = 297540
2008-07-26 21:22:25 [MgmSrvr] INFO -- Node 2: Local checkpoint 212 started. Keep GCI = 388366 oldest restorable GCI = 297540
2008-07-26 22:24:37 [MgmSrvr] INFO -- Node 2: Local checkpoint 213 started. Keep GCI = 390268 oldest restorable GCI = 297540
2008-07-26 23:26:38 [MgmSrvr] INFO -- Node 2: Local checkpoint 214 started. Keep GCI = 392172 oldest restorable GCI = 297540
2008-07-27 00:28:49 [MgmSrvr] INFO -- Node 2: Local checkpoint 215 started. Keep GCI = 394075 oldest restorable GCI = 297540
2008-07-27 01:30:50 [MgmSrvr] INFO -- Node 2: Local checkpoint 216 started. Keep GCI = 395978 oldest restorable GCI = 297540
2008-07-27 02:32:59 [MgmSrvr] INFO -- Node 2: Local checkpoint 217 started. Keep GCI = 397884 oldest restorable GCI = 297540
2008-07-27 03:34:55 [MgmSrvr] INFO -- Node 2: Local checkpoint 218 started. Keep GCI = 399788 oldest restorable GCI = 297540
2008-07-27 04:22:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:22:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:22:19 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:22:34 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:22:37 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:22:42 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630023a6911e7d]
2008-07-27 04:22:51 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:22:51 [MgmSrvr] ALERT -- Node 2: Node 4 Disconnected
2008-07-27 04:22:51 [MgmSrvr] INFO -- Node 2: Communication to Node 4 closed
2008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 5 missed heartbeat 22008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2
2008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3
2008-07-27 04:22:51 [MgmSrvr] INFO -- Node 2: Communication to Node 4 opened2008-07-27 04:22:52 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:22:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:23:05 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:23:15 [MgmSrvr] INFO -- Node 3: Communication to Node 5 closed
2008-07-27 04:23:15 [MgmSrvr] ALERT -- Node 3: Node 5 Disconnected
2008-07-27 04:23:17 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630026a691bc30]
2008-07-27 04:23:20 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:23:20 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:23:20 [MgmSrvr] INFO -- Node 3: Communication to Node 5 opened
2008-07-27 04:23:20 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:23:25 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:23:53 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:23:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:24:00 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630028a691fc47]
2008-07-27 04:24:04 [MgmSrvr] INFO -- Node 3: Communication to Node 5 opened
2008-07-27 04:24:09 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:24:09 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:24:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:24:18 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:24:20 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:24:27 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:24:37 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Communication to Node 4 opened
2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Communication to Node 5 opened
2008-07-27 04:24:54 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2
2008-07-27 04:24:54 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3
2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Lost arbitrator node 1 - timeout [state=5]
2008-07-27 04:24:54 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:24:57 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:25:07 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:25:07 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:25:19 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:25:21 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 3
2008-07-27 04:25:28 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:25:31 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:25:33 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-07-27 04:25:33 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:25:38 [MgmSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
How to repeat:
Use the same configuration. It happens all the time after a couple of days of running.