Bug #21151 Forced node shutdown caused by error 2305
Submitted: 19 Jul 2006 14:00 Modified: 28 Aug 2006 8:24
Reporter: Eugene Gorelik Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.22 OS:Linux (Linux RHEL ES 4)
Assigned to: CPU Architecture:Any

[19 Jul 2006 14:00] Eugene Gorelik
Description:
We are running MySQL cluster 5.0.22 on 64-bit RHEL OS with 2 CPU's

Our cluster consists of 2 data nodes and 1 management node.
Periodically ou data nodes crashing without any obvious reason with following errors:

Data node error log:

Time: Tuesday 11 July 2006 - 21:17:33
Status: Temporary error, restart node
Message: Arbitrator shutdown, please investigate error(s) on other node(s) (Arbitration error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 4556) 0x0000000a
Program: /opt/mysql/bin/ndbd
Pid: 2888
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.2
Version: Version 5.0.22
***EOM***

Data node output log:

2006-07-11 21:17:33 [ndbd] INFO     -- Error handler shutting down system
2006-07-11 21:17:34 [ndbd] INFO     -- Error handler shutdown completed - exiting
2006-07-11 21:17:34 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutd
own, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

This issue occurs on both data nodes at exactly same time.
This is a brand new cluster and it's not being used in production yet, so this issue can't be caused by a performance hit.
  

How to repeat:
Unknown
[19 Jul 2006 14:04] MySQL Verification Team
Changing to Cluster Category.
[19 Jul 2006 14:07] Eugene Gorelik
Trace log

Attachment: ndb_3_trace.log.2.gz (application/x-gzip, text), 38.83 KiB.

[22 Jul 2006 9:44] see wai seok
Hi,
I'm using RHEL4 with mysql-max-5.0.22-linux-i686-icc-glibc23.tar.gz.

I'd setup 1 mgmt node (node A), 2 NDB node (node B and C), and 1 mysql server (node D).

They running, seems not much problem.

I run "/usr/local/mysql/bin/ndbd -d" on both of my NDB nodes. I can see two(2) copies of "ndbd -d" processes running on each of them.

When I do a stress test by unplug the network cable of node B. After few seconds, once of the "ndbd -d" process got killed by itself (the other copy still running). When I plug back the network cable, I can see below message on my mgmt node - ndb_mgm console:

2006-07-23 01:18:37 [MgmSrvr] ALERT    -- Node 2: Forced node shutdown completed. Initiated by signal 0. Caused by error 2305: 'Arbitrator shutdown, please investigate error(s) on other node(s)(Arbitration error). Temporary error, restart node'.

When this error appeared, the other "ndbd -d" process on node B will get killed automatically too! This cause my node B unable connect back to the clustering setup forever until I manually issue "ndbd -d" again.

I read through alot of bugfix, it doesn't seems to be fixed till version 5.0.22. Highly appreciate if someone able to rectify the problem. Thanks!!

regards,
rachel
[28 Jul 2006 7:51] Hartmut Holzgraefe
> When this error appeared, the other "ndbd -d" process on 
> node B will get killed automatically too! This cause my
> node B unable connect back to the clustering
> setup forever until I manually issue "ndbd -d" again.

See http://dev.mysql.com/doc/refman/5.0/en/mysql-cluster-ndbd-definition.html#id3146375

  * StopOnError

  This parameter specifies whether an ndbd process should exit 
  or perform an automatic restart when an error condition is encountered.

  This feature is enabled by default.
[28 Jul 2006 8:13] see wai seok
Yes, by disable that function, it works!
But, does this affect the performance?
[28 Jul 2006 8:24] Hartmut Holzgraefe
As both nodes shut down at exactly the same time we'll need the full set of logs of both data nodes and the management node to analyze this.
[28 Aug 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".