Description:
I am running the following configuration:
1 Mgm node
2 API nodes
2 Data nodes
I am using a Xen server with CentOS 5 running on Dom0 and DomUs. It is a test system. Each DomU is running on 2G RAM. Database is not big at all (just o couple of rows).
The cluster runs fine for 7-8 days and then all nodes, except the MGM nodes, become disconnected.
Here is the config.ini:
# Options affecting ndbd processes on all data nodes:
[ndbd default]
NoOfReplicas=2 # Number of replicas
DataMemory=80M # How much memory to allocate for data storage
IndexMemory=18M # How much memory to allocate for index storage
# For DataMemory and IndexMemory, we have used the
# default values. Since the "world" database takes up
# only about 500KB, this should be more than enough for
# this example Cluster setup.
TransactionDeadlockDetectionTimeout=8000
# TCP/IP options:
#[tcp default]
#portnumber=2202 # This the default; however, you can use any port that is free
# for all the hosts in the cluster
# Note: It is recommended that you do not specify the port
# number at all and allow the default value to be used instead
# Management process options:
[ndb_mgmd]
hostname=172.17.1.18 # Hostname or IP address of MGM node
#datadir=/var/lib/mysql-cluster # Directory for MGM node log files
# Options for data node "A":
[ndbd]
# (one [ndbd] section per data node)
hostname=172.17.1.8 # Hostname or IP address
datadir=/usr/local/mysql/data # Directory for this data node's data files
# Options for data node "B":
[ndbd]
hostname=172.17.1.9 # Hostname or IP address
datadir=/usr/local/mysql/data # Directory for this data node's data files
# SQL node options:
[mysqld]
hostname=172.17.1.17
[mysqld]
hostname=172.17.1.16
[mysqld default]
[tcp default]
The following is the log file:
2008-07-26 20:20:19 [MgmSrvr] INFO -- Node 2: Local checkpoint 211 started. Keep GCI = 386462 oldest restorable GCI = 297540
2008-07-26 21:22:25 [MgmSrvr] INFO -- Node 2: Local checkpoint 212 started. Keep GCI = 388366 oldest restorable GCI = 297540
2008-07-26 22:24:37 [MgmSrvr] INFO -- Node 2: Local checkpoint 213 started. Keep GCI = 390268 oldest restorable GCI = 297540
2008-07-26 23:26:38 [MgmSrvr] INFO -- Node 2: Local checkpoint 214 started. Keep GCI = 392172 oldest restorable GCI = 297540
2008-07-27 00:28:49 [MgmSrvr] INFO -- Node 2: Local checkpoint 215 started. Keep GCI = 394075 oldest restorable GCI = 297540
2008-07-27 01:30:50 [MgmSrvr] INFO -- Node 2: Local checkpoint 216 started. Keep GCI = 395978 oldest restorable GCI = 297540
2008-07-27 02:32:59 [MgmSrvr] INFO -- Node 2: Local checkpoint 217 started. Keep GCI = 397884 oldest restorable GCI = 297540
2008-07-27 03:34:55 [MgmSrvr] INFO -- Node 2: Local checkpoint 218 started. Keep GCI = 399788 oldest restorable GCI = 297540
2008-07-27 04:22:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:22:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:22:19 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:22:34 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:22:37 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:22:42 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630023a6911e7d]
2008-07-27 04:22:51 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:22:51 [MgmSrvr] ALERT -- Node 2: Node 4 Disconnected
2008-07-27 04:22:51 [MgmSrvr] INFO -- Node 2: Communication to Node 4 closed
2008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 5 missed heartbeat 22008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2
2008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3
2008-07-27 04:22:51 [MgmSrvr] INFO -- Node 2: Communication to Node 4 opened2008-07-27 04:22:52 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:22:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:23:05 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:23:15 [MgmSrvr] INFO -- Node 3: Communication to Node 5 closed
2008-07-27 04:23:15 [MgmSrvr] ALERT -- Node 3: Node 5 Disconnected
2008-07-27 04:23:17 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630026a691bc30]
2008-07-27 04:23:20 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:23:20 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:23:20 [MgmSrvr] INFO -- Node 3: Communication to Node 5 opened
2008-07-27 04:23:20 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:23:25 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:23:53 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:23:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:24:00 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630028a691fc47]
2008-07-27 04:24:04 [MgmSrvr] INFO -- Node 3: Communication to Node 5 opened
2008-07-27 04:24:09 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:24:09 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:24:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:24:18 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:24:20 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:24:27 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:24:37 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Communication to Node 4 opened
2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Communication to Node 5 opened
2008-07-27 04:24:54 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2
2008-07-27 04:24:54 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3
2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Lost arbitrator node 1 - timeout [state=5]
2008-07-27 04:24:54 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:24:57 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:25:07 [MgmSrvr] INFO -- Node 1: Node 3 Connected
2008-07-27 04:25:07 [MgmSrvr] INFO -- Node 1: Node 2 Connected
2008-07-27 04:25:19 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:25:21 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 3
2008-07-27 04:25:28 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2008-07-27 04:25:31 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:25:33 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-07-27 04:25:33 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected
2008-07-27 04:25:38 [MgmSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
How to repeat:
Use the same configuration. It happens all the time after a couple of days of running.
Description: I am running the following configuration: 1 Mgm node 2 API nodes 2 Data nodes I am using a Xen server with CentOS 5 running on Dom0 and DomUs. It is a test system. Each DomU is running on 2G RAM. Database is not big at all (just o couple of rows). The cluster runs fine for 7-8 days and then all nodes, except the MGM nodes, become disconnected. Here is the config.ini: # Options affecting ndbd processes on all data nodes: [ndbd default] NoOfReplicas=2 # Number of replicas DataMemory=80M # How much memory to allocate for data storage IndexMemory=18M # How much memory to allocate for index storage # For DataMemory and IndexMemory, we have used the # default values. Since the "world" database takes up # only about 500KB, this should be more than enough for # this example Cluster setup. TransactionDeadlockDetectionTimeout=8000 # TCP/IP options: #[tcp default] #portnumber=2202 # This the default; however, you can use any port that is free # for all the hosts in the cluster # Note: It is recommended that you do not specify the port # number at all and allow the default value to be used instead # Management process options: [ndb_mgmd] hostname=172.17.1.18 # Hostname or IP address of MGM node #datadir=/var/lib/mysql-cluster # Directory for MGM node log files # Options for data node "A": [ndbd] # (one [ndbd] section per data node) hostname=172.17.1.8 # Hostname or IP address datadir=/usr/local/mysql/data # Directory for this data node's data files # Options for data node "B": [ndbd] hostname=172.17.1.9 # Hostname or IP address datadir=/usr/local/mysql/data # Directory for this data node's data files # SQL node options: [mysqld] hostname=172.17.1.17 [mysqld] hostname=172.17.1.16 [mysqld default] [tcp default] The following is the log file: 2008-07-26 20:20:19 [MgmSrvr] INFO -- Node 2: Local checkpoint 211 started. Keep GCI = 386462 oldest restorable GCI = 297540 2008-07-26 21:22:25 [MgmSrvr] INFO -- Node 2: Local checkpoint 212 started. Keep GCI = 388366 oldest restorable GCI = 297540 2008-07-26 22:24:37 [MgmSrvr] INFO -- Node 2: Local checkpoint 213 started. Keep GCI = 390268 oldest restorable GCI = 297540 2008-07-26 23:26:38 [MgmSrvr] INFO -- Node 2: Local checkpoint 214 started. Keep GCI = 392172 oldest restorable GCI = 297540 2008-07-27 00:28:49 [MgmSrvr] INFO -- Node 2: Local checkpoint 215 started. Keep GCI = 394075 oldest restorable GCI = 297540 2008-07-27 01:30:50 [MgmSrvr] INFO -- Node 2: Local checkpoint 216 started. Keep GCI = 395978 oldest restorable GCI = 297540 2008-07-27 02:32:59 [MgmSrvr] INFO -- Node 2: Local checkpoint 217 started. Keep GCI = 397884 oldest restorable GCI = 297540 2008-07-27 03:34:55 [MgmSrvr] INFO -- Node 2: Local checkpoint 218 started. Keep GCI = 399788 oldest restorable GCI = 297540 2008-07-27 04:22:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:22:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:22:19 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2008-07-27 04:22:34 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2008-07-27 04:22:37 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2008-07-27 04:22:42 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630023a6911e7d] 2008-07-27 04:22:51 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2008-07-27 04:22:51 [MgmSrvr] ALERT -- Node 2: Node 4 Disconnected 2008-07-27 04:22:51 [MgmSrvr] INFO -- Node 2: Communication to Node 4 closed 2008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 5 missed heartbeat 22008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2008-07-27 04:22:51 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3 2008-07-27 04:22:51 [MgmSrvr] INFO -- Node 2: Communication to Node 4 opened2008-07-27 04:22:52 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:22:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2008-07-27 04:23:05 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2008-07-27 04:23:15 [MgmSrvr] INFO -- Node 3: Communication to Node 5 closed 2008-07-27 04:23:15 [MgmSrvr] ALERT -- Node 3: Node 5 Disconnected 2008-07-27 04:23:17 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630026a691bc30] 2008-07-27 04:23:20 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:23:20 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2 2008-07-27 04:23:20 [MgmSrvr] INFO -- Node 3: Communication to Node 5 opened 2008-07-27 04:23:20 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:23:25 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2008-07-27 04:23:53 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2008-07-27 04:23:57 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2008-07-27 04:24:00 [MgmSrvr] INFO -- Node 3: Prepare arbitrator node 1 [ticket=34630028a691fc47] 2008-07-27 04:24:04 [MgmSrvr] INFO -- Node 3: Communication to Node 5 opened 2008-07-27 04:24:09 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:24:09 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2 2008-07-27 04:24:15 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:24:18 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2008-07-27 04:24:20 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2008-07-27 04:24:27 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2008-07-27 04:24:37 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Communication to Node 4 opened 2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Communication to Node 5 opened 2008-07-27 04:24:54 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 2 2008-07-27 04:24:54 [MgmSrvr] WARNING -- Node 2: Node 1 missed heartbeat 3 2008-07-27 04:24:54 [MgmSrvr] INFO -- Node 2: Lost arbitrator node 1 - timeout [state=5] 2008-07-27 04:24:54 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:24:57 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2008-07-27 04:25:07 [MgmSrvr] INFO -- Node 1: Node 3 Connected 2008-07-27 04:25:07 [MgmSrvr] INFO -- Node 1: Node 2 Connected 2008-07-27 04:25:19 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 2 2008-07-27 04:25:21 [MgmSrvr] WARNING -- Node 3: Node 2 missed heartbeat 3 2008-07-27 04:25:28 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected 2008-07-27 04:25:31 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:25:33 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'. 2008-07-27 04:25:33 [MgmSrvr] ALERT -- Node 1: Node 2 Disconnected 2008-07-27 04:25:38 [MgmSrvr] ALERT -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'. How to repeat: Use the same configuration. It happens all the time after a couple of days of running.