Bug #38530 All nodes crash in the cluster after a couple of days of running
Submitted: 2 Aug 2008 12:39 Modified: 11 Sep 2008 6:24
Reporter: Farid Fadaie Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:6.3.16 OS:Linux
Assigned to: CPU Architecture:Any
Tags: cluster, crash
Triage: D1 (Critical)

[2 Aug 2008 12:39] Farid Fadaie
Description:
I am running the following configuration:

1 Mgm node
2 API nodes
2 Data nodes

I am using a Xen server with CentOS 5 running on Dom0 and DomUs. It is a test system. Each DomU is running on 2G RAM. Database is not big at all (just o couple of rows).
The cluster runs fine for 7-8 days and then all nodes, except the MGM nodes, become disconnected.
Here is the config.ini:

# Options affecting ndbd processes on all data nodes:
[ndbd default]
NoOfReplicas=2    # Number of replicas
DataMemory=80M    # How much memory to allocate for data storage
IndexMemory=18M   # How much memory to allocate for index storage
                  # For DataMemory and IndexMemory, we have used the
                  # default values. Since the "world" database takes up
                  # only about 500KB, this should be more than enough for
                  # this example Cluster setup.
TransactionDeadlockDetectionTimeout=8000
# TCP/IP options:
#[tcp default]
#portnumber=2202   # This the default; however, you can use any port that is free
                  # for all the hosts in the cluster
                  # Note: It is recommended that you do not specify the port
                  # number at all and allow the default value to be used instead

# Management process options:
[ndb_mgmd]
hostname=172.17.1.18           # Hostname or IP address of MGM node
#datadir=/var/lib/mysql-cluster  # Directory for MGM node log files

# Options for data node "A":
[ndbd]
                                # (one [ndbd] section per data node)
hostname=172.17.1.8           # Hostname or IP address
datadir=/usr/local/mysql/data   # Directory for this data node's data files

# Options for data node "B":
[ndbd]
hostname=172.17.1.9           # Hostname or IP address
datadir=/usr/local/mysql/data   # Directory for this data node's data files

# SQL node options:
[mysqld]
hostname=172.17.1.17
[mysqld]
hostname=172.17.1.16
[mysqld default]
[tcp default]

The following is the log file:

2008-07-26 20:20:19 [MgmSrvr] INFO     -- Node 2: Local checkpoint 211 started. Keep GCI = 386462 oldest restorable GCI = 297540
2008-07-26 21:22:25 [MgmSrvr] INFO     -- Node 2: Local checkpoint 212 started. Keep GCI = 388366 oldest restorable GCI = 297540
2008-07-26 22:24:37 [MgmSrvr] INFO     -- Node 2: Local checkpoint 213 started. Keep GCI = 390268 oldest restorable GCI = 297540
2008-07-26 23:26:38 [MgmSrvr] INFO     -- Node 2: Local checkpoint 214 started. Keep GCI = 392172 oldest restorable GCI = 297540
2008-07-27 00:28:49 [MgmSrvr] INFO     -- Node 2: Local checkpoint 215 started. Keep GCI = 394075 oldest restorable GCI = 297540
2008-07-27 01:30:50 [MgmSrvr] INFO     -- Node 2: Local checkpoint 216 started. Keep GCI = 395978 oldest restorable GCI = 297540
2008-07-27 02:32:59 [MgmSrvr] INFO     -- Node 2: Local checkpoint 217 started. Keep GCI = 397884 oldest restorable GCI = 297540
2008-07-27 03:34:55 [MgmSrvr] INFO     -- Node 2: Local checkpoint 218 started. Keep GCI = 399788 oldest restorable GCI = 297540
2008-07-27 04:22:15 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:22:15 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:22:19 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:22:34 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:22:37 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:22:42 [MgmSrvr] INFO     -- Node 3: Prepare arbitrator node 1 [ticket=34630023a6911e7d]
2008-07-27 04:22:51 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:22:51 [MgmSrvr] ALERT    -- Node 2: Node 4 Disconnected
2008-07-27 04:22:51 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 closed
2008-07-27 04:22:51 [MgmSrvr] WARNING  -- Node 2: Node 5 missed heartbeat 22008-07-27 04:22:51 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2008-07-27 04:22:51 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3
2008-07-27 04:22:51 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 opened2008-07-27 04:22:52 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:22:57 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:23:05 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:23:15 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 closed
2008-07-27 04:23:15 [MgmSrvr] ALERT    -- Node 3: Node 5 Disconnected
2008-07-27 04:23:17 [MgmSrvr] INFO     -- Node 3: Prepare arbitrator node 1 [ticket=34630026a691bc30]
2008-07-27 04:23:20 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:23:20 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:23:20 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 opened
2008-07-27 04:23:20 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:23:25 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:23:53 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:23:57 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:24:00 [MgmSrvr] INFO     -- Node 3: Prepare arbitrator node 1 [ticket=34630028a691fc47]
2008-07-27 04:24:04 [MgmSrvr] INFO     -- Node 3: Communication to Node 5 opened
2008-07-27 04:24:09 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:24:09 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:24:15 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:24:18 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:24:20 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:24:27 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:24:37 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:24:54 [MgmSrvr] INFO     -- Node 2: Communication to Node 4 opened
2008-07-27 04:24:54 [MgmSrvr] INFO     -- Node 2: Communication to Node 5 opened
2008-07-27 04:24:54 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 2
2008-07-27 04:24:54 [MgmSrvr] WARNING  -- Node 2: Node 1 missed heartbeat 3
2008-07-27 04:24:54 [MgmSrvr] INFO     -- Node 2: Lost arbitrator node 1 - timeout [state=5]
2008-07-27 04:24:54 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:24:57 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:25:07 [MgmSrvr] INFO     -- Node 1: Node 3 Connected
2008-07-27 04:25:07 [MgmSrvr] INFO     -- Node 1: Node 2 Connected
2008-07-27 04:25:19 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 2
2008-07-27 04:25:21 [MgmSrvr] WARNING  -- Node 3: Node 2 missed heartbeat 3
2008-07-27 04:25:28 [MgmSrvr] ALERT    -- Node 1: Node 3 Disconnected
2008-07-27 04:25:31 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:25:33 [MgmSrvr] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 11. Caused by error 6000: 'Error OS signal received(Internal error, programming error or missing error message, please report a bug). Temporary error, restart node'.
2008-07-27 04:25:33 [MgmSrvr] ALERT    -- Node 1: Node 2 Disconnected
2008-07-27 04:25:38 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

How to repeat:
Use the same configuration. It happens all the time after a couple of days of running.
[2 Aug 2008 15:02] Hartmut Holzgraefe
The main cluster log is not sufficient to solve this case, we also need to see the data nodes output, error and trace log files.

You can collect a full set of log files using the ndb_error_reporter tool:

http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-utilities-ndb-error-reporter.html
[2 Aug 2008 15:30] Farid Fadaie
I have the file from ndb_error_reporter but it is big! The file size here is limited to 500K. Mine is 972K. I can email it to you if you want.
[2 Aug 2008 15:44] Miguel Solorzano
You can upload the file at: ftp://ftp.mysql.com/pub/mysql/upload with a name identifying this bug report i.e bug38530.zip and comment here when done. Thanks in advance.
[2 Aug 2008 15:52] Farid Fadaie
Done. (bug38530.tar.bz2)
[2 Aug 2008 17:13] Farid Fadaie
I don't know if this is related or not (as I don't know how NDB works) but cpu clock on Xen domains is not accurate at all. There are some issues with the motherboard that I am using and Xen cannot handle it well for now. 
I enforce the accurate time every 5 minutes from a local NTP server though. Therefore "Time/Date" is accurate enough.
It shouldn't matter for what I know but it was worth mentioning.
[10 Aug 2008 21:29] Farid Fadaie
I restarted it and today something went wrong again! This time it looks like the management node crashed and other nodes couldn't reconnect.
I don't understand, however, why would other nodes disconnect so often in the first place? (just look at the log)
I uploaded the report (ndb_error_report_20080810170629.tar.bz2)
[11 Aug 2008 6:22] Bernd Ocklin
Hi Farid,

please do not run cluster in a virtual machine setting. Its very likely that all of the errors I see come from Xen runtime behavior (e.g. watchdog termination due to system overload means that scheduler left no time to ndbd). Please re-test on real hardware and let us know if you still have issues.
[11 Sep 2008 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".