MySQL Bugs: #38530: All nodes crash in the cluster after a couple of days of running

Bug #38530	All nodes crash in the cluster after a couple of days of running
Submitted:	2 Aug 2008 12:39	Modified:	11 Sep 2008 6:24
Reporter:	Farid Fadaie	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	6.3.16	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	cluster, crash

Description:
I am running the following configuration:

1 Mgm node
2 API nodes
2 Data nodes

I am using a Xen server with CentOS 5 running on Dom0 and DomUs. It is a test system. Each DomU is running on 2G RAM. Database is not big at all (just o couple of rows).
The cluster runs fine for 7-8 days and then all nodes, except the MGM nodes, become disconnected.
Here is the config.ini:

# Options affecting ndbd processes on all data nodes:
[ndbd default]
NoOfReplicas=2    # Number of replicas
DataMemory=80M    # How much memory to allocate for data storage
IndexMemory=18M   # How much memory to allocate for index storage
                  # For DataMemory and IndexMemory, we have used the
                  # default values. Since the "world" database takes up
                  # only about 500KB, this should be more than enough for
                  # this example Cluster setup.
TransactionDeadlockDetectionTimeout=8000
# TCP/IP options:
#[tcp default]
#portnumber=2202   # This the default; however, you can use any port that is free
                  # for all the hosts in the cluster
                  # Note: It is recommended that you do not specify the port
                  # number at all and allow the default value to be used instead

# Management process options:
[ndb_mgmd]
hostname=172.17.1.18           # Hostname or IP address of MGM node
#datadir=/var/lib/mysql-cluster  # Directory for MGM node log files

# Options for data node "A":
[ndbd]
                                # (one [ndbd] section per data node)
hostname=172.17.1.8           # Hostname or IP address
datadir=/usr/local/mysql/data   # Directory for this data node's data files

# Options for data node "B":
[ndbd]
hostname=172.17.1.9           # Hostname or IP address
datadir=/usr/local/mysql/data   # Directory for this data node's data files

# SQL node options:
[mysqld]
hostname=172.17.1.17
[mysqld]
hostname=172.17.1.16
[mysqld default]
[tcp default]

The following is the log file:

2008-07-26 20:20:19 [MgmSrvr] INFO     -- Node 2: Local checkpoint 211 started. Keep GCI = 386462 oldest restorable GCI = 297540
2008-07-26 21:22:25 [MgmSrvr] INFO     -- Node 2: Local checkpoint 212 started. Keep GCI = 388366 oldest restorable GCI = 297540
2008-07-26 22:24:37 [MgmSrvr] INFO     -- Node 2: Local checkpoint 213 started. Keep GCI = 390268 oldest restorable GCI = 297540
2008-07-26 23:26:38 [MgmSrvr] INFO     -- Node 2: Local checkpoint 214 started. Keep GCI = 392172 oldest restorable GCI = 297540
2008-07-27 00:28:49 [MgmSrvr] INFO     -- Node 2: Local checkpoint 215 started. Keep GCI = 394075 oldest restorable GCI = 297540
2008-07-27 01:30:50 [MgmSrvr] INFO     -- Node 2: Local checkpoint 216 started. Keep GCI = 395978 oldest restorable GCI = 297540
2008-07-27 02:32:59 [MgmSrvr] INFO     -- Node 2: Local checkpoint 217 started. Keep GCI = 397884 oldest restorable GCI = 297540
2008-07-27 03:34:55 [MgmSrvr] INFO     -- Node 2: Local checkpoint 218 started. Keep GCI = 399788 oldest restorable GCI = 297540
2008-07-27 04:22:15 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:22:15 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:22:19 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:22:34 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:22:37 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:22:42 [MgmSrvr] INFO     -- Node 3: Prepare arbitrator node 1 [ticket=34630023a6911e7d]
2008-07-27 04:22:51 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:22:51 [MgmSrvr] ALERT    -- Node 2: Node 4 Disconnected
2008-07-27 04:22:51 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 closed
2008-07-27 04:22:51 [MgmSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 22008-07-27 04:22:51 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2008-07-27 04:22:51 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3
2008-07-27 04:22:51 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 opened2008-07-27 04:22:52 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:22:57 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:23:05 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:23:15 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 closed
2008-07-27 04:23:15 [MgmSrvr] ALERT    -- Node 3: Node 5 Disconnected
2008-07-27 04:23:17 [MgmSrvr] INFO     -- Node 3: Prepare arbitrator node 1 [ticket=34630026a691bc30]
2008-07-27 04:23:20 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:23:20 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:23:20 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 opened
2008-07-27 04:23:20 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:23:25 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:23:53 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:23:57 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:24:00 [MgmSrvr] INFO     -- Node 3: Prepare arbitrator node 1 [ticket=34630028a691fc47]
2008-07-27 04:24:04 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 opened
2008-07-27 04:24:09 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:24:09 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:24:15 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:24:18 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:24:20 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:24:27 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:24:37 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:24:54 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 opened
2008-07-27 04:24:54 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 opened
2008-07-27 04:24:54 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2008-07-27 04:24:54 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3
2008-07-27 04:24:54 [MgmSrvr] INFO     -- Node 2: Lost arbitrator node 1 - timeout [state=5]
2008-07-27 04:24:54 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:24:57 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:25:07 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:25:07 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:25:19 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:25:21 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 3
2008-07-27 04:25:28 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:25:31 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:25:33 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-07-27 04:25:33 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:25:38 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

How to repeat:
Use the same configuration. It happens all the time after a couple of days of running.

The main cluster log is not sufficient to solve this case, we also need to see the data nodes output, error and trace log files.

You can collect a full set of log files using the ndb_error_reporter tool:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-utilities-ndb-error-reporter.html

I have the file from ndb_error_reporter but it is big! The file size here is limited to 500K. Mine is 972K. I can email it to you if you want.

You can upload the file at: ftp://ftp.mysql.com/pub/mysql/upload with a name identifying this bug report i.e bug38530.zip and comment here when done. Thanks in advance.

Done. (bug38530.tar.bz2)

I don't know if this is related or not (as I don't know how NDB works) but cpu clock on Xen domains is not accurate at all. There are some issues with the motherboard that I am using and Xen cannot handle it well for now. 
I enforce the accurate time every 5 minutes from a local NTP server though. Therefore "Time/Date" is accurate enough.
It shouldn't matter for what I know but it was worth mentioning.

I restarted it and today something went wrong again! This time it looks like the management node crashed and other nodes couldn't reconnect.
I don't understand, however, why would other nodes disconnect so often in the first place? (just look at the log)
I uploaded the report (ndb_error_report_20080810170629.tar.bz2)

Hi Farid,

please do not run cluster in a virtual machine setting. Its very likely that all of the errors I see come from Xen runtime behavior (e.g. watchdog termination due to system overload means that scheduler left no time to ndbd). Please re-test on real hardware and let us know if you still have issues.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".