Bug #25190 Data nodes unstable
Submitted: 19 Dec 2006 22:12 Modified: 16 Mar 2008 14:46
Reporter: jim shnider Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.1.14, 5.0.32, 5.0.26, 5.1.16 OS:Linux (linux)
Assigned to: CPU Architecture:Any
Tags: 2 node, 2 replica, 3 node, 3 replica, 5.0.26, 5.0.32, 5.1.14, centos 4.3, cluster, crash, data node, linux, ndbcluster, ndbd, unstable, x86_64

[19 Dec 2006 22:12] jim shnider
Description:
I recently upgraded to 5.0.26 from a mish-mash of 5.0.19 and 5.0.20 installations of mysql running acrossed 5 systems: 3x(ndbd + mysqld) + 2x(ndb_mgmd).

(I upgraded because, once a week or so, the mysqld nodes would lose their connections to the cluster and need to be restarted to become functional again.  The databases were still there and filled with tables but queries would complain that they could not obtain a lock.)

Once the upgrade was complete, the cluster ran for a little more than a day before all ndbd nodes went down.

I didn't know what to make of this at the time, so I got all of the ndbd nodes running again, and walked away.

The next day, the data nodes were down again.  Again I got them going.  This pattern repeated for a while, along with long periods of down-time in which I did not bother to restart the data nodes and get the cluster working again.

Here, I have to apologize because my cluster has been doing this for a while, and I have only recently been gathering rigorous documentation.

I recall that the various crash instances seemed to have varying errors logged as the cause of failure in each instance.  At least once, after the crash, I had to start each of the mysql-cluster data directories ('ndbfs'es) from scratch in order to get the ndbd nodes to start up again (with --initial-start, of course).  Most times, I was able to simply execute ndbd on each data node machine.  In the last incident that occurred in the 3x2 configuration, the crash brought down the first two data nodes, but left the third one operational (the only time that any of the data nodes survived the loss of one node).

Before the last two or three crash events, I found a bug (I think this one: http://bugs.mysql.com/bug.php?id=23297 )in this database that came up in a search for a particular bit of error text.  One suggested fix for that bug was increasing memory available for data, so I added 'datamemory' and 'indexmemory' values to [ndbd_default] in config.ini.

(Our database uses less than 400kB in a complete dump.  I imagine that it uses less than 4MB in memory with all indexes.  That figure is significantly less than the 80MB and 18MB allocated by default.  I believed that if adding large memory specifications was going to help, it would be for some indirect or non-obvious reason.)

As one might expect, the memory specifications did not fix the problem.  After the above mentioned 'last incident', I tried rearranging the nodes.  The new arrangement I tried was 3x(data + api + mgmt).

(
I did this for three reasons.

Firstly, the three machines with data and api nodes were Xeon 64bit machines but the two mgmt nodes were Athlon 32bit machines.  I was wondering if the architecture difference might be contributing to the instability.

Secondly, three physical machines should be enough to ensure reliability in all but the most catastrophic cases.  I know that MySQL recommends that data and mgmt nodes not run on the same hardware but I set this consideration aside believing that potential stability improvements resulting from a node rearrangement would outweigh risks associated with the simultaneous loss of a data and mgmt node.

Thirdly, three machines (that are configured nearly identically) are easier to administrate than five machines with two distinct configuration schemes.
}

This rearrangement lasted for less than 12 hours before it crashed the first time.  The crash profile looked about the same as before.  I appears that this changeover also had no effect.  This is effectively where I stand now.  At this point, I'm leaving the cluster off, and I've hot-rigged the system to use one api instance with myisam tables.  The system that is running mysqld is setup to claim all three external IPs, so I'm working transparently-as-hot-rigged for now.

I might also mention that the three data/api machines were also running linux HAv2 to manage external IP takeover in the original semi-stable configuration (v5.0.19+v5.0.20 where mysqld instances would infrequently disconnect from the cluster).  HA was a little twitchy (the release of HA2 I had installed was an early beta release) so I disabled it around the same time that (maybe a week before) I upgraded to 5.0.26.

Other than HA, the machines run a lighttpd/ruby web stack, and almost nothing else.  They run CentOS (v4.3?), the rebuild of RHEL.  The hardware on each is a minimally configured Dell PowerEdge 1850 (none of the fancy bells and whistles).  The transport is 1Gb ethernet switched.

Node numbers when the system failed in the new configuration:

  node_type 10.0.0.1 10.0.0.2 10.0.0.3
       mgmt        1        2        3
       data        4        5        6
     mysqld        7        8        9
 api_unused       10       11       12

How to repeat:
The description gives first-order coverage of how I produced the bug.

There is lots more that could be said.  What do you need?
[19 Dec 2006 22:20] Jonas Oreland
Hi,

Sorry to hear about all your problems.
Did you have any problems before upgrade ?

To investigate problem we need all 
* cluster logs (ndb_*_cluster.log*)
* error logs (ndb_*_error.log)
* tracefiles (ndb_*_trace.log.*)
* config.ini

(these files can be gatherup by using the ndb_error_reporter script)

/Jonas
[19 Dec 2006 22:25] jim shnider
root config

Attachment: my.cnf (application/octet-stream, text), 348 bytes.

[19 Dec 2006 22:25] jim shnider
ndb config

Attachment: config.ini (application/octet-stream, text), 685 bytes.

[20 Dec 2006 10:24] Jonas Oreland
Hi, 

I think the "unstable" part comes from http://bugs.mysql.com/bug.php?id=21535
which was fixed in 5.0.29 :-(

I think the restart problem comes from http://bugs.mysql.com/bug.php?id=22696
which is still not fixed.

/Jonas
[21 Dec 2006 15:42] jim shnider
Based on mysql's currently release policies, when should the community expect access to 5.0.29 binary packages?
[29 Dec 2006 22:50] jim shnider
I've built 5.0.32 from the 15 December 2006 nightly available here: http://downloads.mysql.com/snapshots/mysql-5.0/mysql-5.0.32-nightly-20061215.tar.gz

I'm running with the same configuration, and I'm still experiencing instability in the data nodes.  The failure profile is very similar if not the same.

I got the new build up and running on the 22 Dec. in the early afternoon.  The data nodes went down at about noon on the 23rd.

(I'll upload the logs in a few minutes.)

Is there some feature I need to explicitly enable to get the benefit of the fix in 5.0.29?  Should I try moving up to bleeding-edge 5.0.34?
[4 Jan 2007 22:06] jim shnider
I tried upgrading to the intel-compiled 64-bit version of 5.1.14.  After getting it set-up, everything seemed to be working until about 18 hrs later.  I logged in to investigate why my web app was reporting errors.  One interpretation of what I observed is that I may have logged-in during a cascade failure of the time observed with the other versions of mysql.

After logging in, the first thing I tried was selecting from an ndb table.  I got:
====>
ERROR 1296 (HY000): Got error 4009 'Cluster Failure' from NDBCLUSTER
<====

Then I looked at ndb_mgm, ran show, and saw:
====>
root@serv-1 ~> ndb_mgm
-- NDB Cluster -- Management Client --
ndb_mgm> show
Connected to Management Server at: 10.0.0.1:1186
Cluster Configuration
---------------------
[ndbd(NDB)]     3 node(s)
id=4    @10.0.0.1  (Version: 5.1.14, Nodegroup: 0, Master)
id=5    @10.0.0.2  (Version: 5.1.14, Nodegroup: 0)
id=6 (not connected, accepting connect from 10.0.0.3)

[ndb_mgmd(MGM)] 3 node(s)
id=1    @10.0.0.1  (Version: 5.1.14)
id=2    @10.0.0.2  (Version: 5.1.14)
id=3    @10.0.0.3  (Version: 5.1.14)

[mysqld(API)]   6 node(s)
id=7 (not connected, accepting connect from 10.0.0.1)
id=8 (not connected, accepting connect from 10.0.0.2)
id=9 (not connected, accepting connect from 10.0.0.3)
id=10 (not connected, accepting connect from 10.0.0.1)
id=11 (not connected, accepting connect from 10.0.0.2)
id=12 (not connected, accepting connect from 10.0.0.3)

ndb_mgm> exit
<====

It was at this point that I gathered the first set of cluster logs.

After gathering the logs, I looked back at ndb_mgm, and I noticed that node 5 had also died.  I immediately started both 5 and 6, to prevent all data nodes from going down.  After a minute or two in phase 4, they both came up, seeminly OK.

Then I tried to restart the api nodes so they would reconnect to the cluster.  I don't have a solid record of events at this point, but I think node 8 reconnected itself without prompting.  I had to restart node 7 twice to make it show up in ndb_mgm.

Before dealing with node 7, I tried to deal with node 9, but 10.0.0.3 seemed to freeze up.  It would respond to pings, but not to web or ssh connections.  After ~5-10mins, it came back, so I checked '/var/log/messages'.  I found the following excerpt there:
====>
[redacted for length - see 'readme' in tarball]
<====

Note that the kernel eventually killed mysqld.  While 10.0.0.3 was inaccessible, I also discovered that 10.0.0.2 was inaccessible.  Later, I went to look at its system messages.  A similar pattern appears there, only that more mysqld processes and a sendmail process are killed:
====>
[redacted for length - see 'readme' in tarball]
<====

After getting everything up and visible on the ndb_mgm console, I imported data from a previous run of mysql cluster.  When the web app (ruby on rails) connected to mysql (the api nodes 8 and 9) on .2 or .3, accesses to some of the tables reported:
====>
Mysql::Error: #HY000Got error 241 'Invalid schema object version' from NDBCLUSTER: SHOW FIELDS FROM [table_name]
<====

I restarted nodes 8 and 9 one last time, and everything was working happily again.  This looks like a different, unrelated error.  Thus, the second set of log files, captured after the cluster was back up and running.
[5 Jan 2007 22:41] jim shnider
Since I last posted and uploaded logs, the system crashed again.  I started the data-nodes up again just to get my web-app working again for a short time.  After that, I brought the cluster down and captured and wrapped-up the logs again.  In case it helps, I'll upload that data.

Because of the continued instability, I've switched back to the hot-rigged setup mentioned above.  (One server, 10.0.0.1, running myisam tables, answering all three IP addresses.)

Someone I've talked-to about my mysql-cluster problems gave me an idea.  This person mentioned that some bug reports indicate that systems with odd numbers of replicas are unstable.  To test this idea, I've set-up a two-replica cluster with the other two machines (.2 and .3), and I've loaded up a backup of my databases.

Soon, I will see if the two-replica version also suffers from the same instability.
[7 Jan 2007 14:33] jim shnider
A similar kind of data-node instability has occurred on the test setup with 2-nodes.  This time, the setup lasted about 15 hrs before the data-nodes shut down due to some kind of data partitioning failure.

From the data-node's error log:
====>
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
<====

Should I be looking into automating restart on my data nodes?  If that is the solution, why doesn't the management node perform a restart instead of a shutdown?  To maintain the high reliability expected of mysql-cluster, would it make sense to handle an error that requires a restart automatically in some way?  

(One possibility might be a companion or angel process that would perform this function under certain circumstances.  For example, if ndbd were to shutdown when the user or mgmt node has not been specifically requested that ndbd stay shutdown permanently, then the angel would restart ndbd.  Maybe, in some cases, it would be valuable if ndbd could even pass a restart-wait interval to the angel before shutting-down.)

I'll upload the logs from the 2-node setup in a few moments.
[24 Mar 2007 3:07] Valeriy Kravchuk
Please, try to repeat with a newer version, 5.1.6 and/or 5.0.37, and inform about the results.
[11 Apr 2007 18:57] jim shnider
I'm sorry it took me so long to run this trial for you guys.

I assume you meant v5.1.16.  Once I found time to get that going, it was down again in ~14hrs.

I did not rigorously compare the new logs to old ones to prove to myself that the errors are related.  Circumstances suggest that this is the case, though.  (Time-to-crash.  Certain elements of the cluster log files.)

Just let me know if there is anything more I can do.  Modify configs... more trials... install monitoring packages... filesystem manipulations....
[8 May 2007 9:27] Jonas Oreland
Hi Jim

two things,
1) in 5.1 3-replica works poorly due to bug#18621
2) oom-memory shortage will result in "bad" behavior of cluster.
   Using LockPagesInMainMemory usually removes these problems.
   (http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-config-params-ndbd.html)

/jonas
[11 May 2007 18:21] jim shnider
I've been running these trials with only 2 nodes.  Does the 3 nodes instability (or some other instability) also apply to 2 node/2 replica configurations?

This test seemed to fail similarly to previous tests using lockpagesinmainmemory=2.  I will restart the test with '=1' and let you know if that made any difference.

This time, I've posted the configs in the form they are being used now and the tail of the mysqld error logs, just in case any of this information helps.  (I noticed that mysqld was 'aware' of some of the trouble in the cluster this time around.)
[15 May 2007 19:05] jim shnider
It looks like a different kind of system failure this time.

When I logged-in to check things out, both mysql nodes were down, and the mgm nodes were running but not responding to mgm client requests.  In fact, on '.3', 'show' and 'all status' caused the client to die with

{{{
Warning, event thread startup failed, degraded printouts as result, errno=115
Connected to Management Server at: 10.0.0.3:1186
Warning, event thread startup failed, degraded printouts as result, errno=115
Connected to Management Server at: 10.0.0.3:1186
Unable get status from management server
Segmentation fault
}}}

On '.2', requests did not crash the client outright, but caused it to hang after printing

{{{
Warning, event thread startup failed, degraded printouts as result, errno=115
Connected to Management Server at: 10.0.0.2:1186
}}}

ndbd on .2 had produced trace and error files, but had apparently restarted itself.

(At this point, I gathered 'somethingDifferent.tar.bz2'.)

In an attempt to determine which parts had failed and which parts were still running, I restarted both mgm nodes.  'show' reported that ndbd on .2 was not actually a member of the cluster, though its processes were still in the process list and claiming ~500 megs.

ndbd on .3 reconnected to the management nodes without any prodding.  Next, I brought the api nodes back up, and they reconnected to the partial cluster OK.

ndbd on .2 would not kill with sigterm (but sigkill worked).  .2 ndbd started up without producing any further errors.

Should I create a new bug report with this information?
[15 May 2007 19:10] jim shnider
To be clear, I should note that I've changed lockpagesinmainmemory to '1' from the '2' in the configs I uploaded last.

The latest crash I reported occurred with this change to config.ini.
[16 Feb 2008 14:46] Valeriy Kravchuk
Please, check if you can get any similar crash with a newer version, 5.1.23-rc, and inform about the results.
[17 Mar 2008 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".