MySQL Bugs: #41203: data node crashing and the restart causes the other to crash (error NR: setLcpAc

Bug #41203	data node crashing and the restart causes the other to crash (error NR: setLcpAc
Submitted:	3 Dec 2008 14:19	Modified:	12 Nov 2009 10:47
Reporter:	Jonathan Carter	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.0	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	cluster, crash, mysql-5.0.45, ndb

Description:
I have a setup running for 6 months so far but recently I have a strange problem as follows :

topology
1 - Management 10.0.0.30
2 - datanode1 10.0.0.40
3 - datanode2 10.0.0.41

symptom
firstly 10.0.0.40 drops out of communication giving this :

id=2 (not connected, accepting connect from 10.0.0.40)
id=3 @10.0.0.41 (Version: 5.0.45, Nodegroup: 0, Master)

I then kill the ndbd on 10.0.0.40 and get the following on the management node

ndb_mgm> Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'. - Unknown error code: Unknown result: Unknown error code

then both are not connected

id=2 (not connected, accepting connect from 10.0.0.40)
id=3 (not connected, accepting connect from 10.0.0.41)

I start 10.0.0.40 and 10.0.0.41 with the ndbd command and then after 20 mins they both join the cluster and all is well again.

does anybody know what I can do to prevent this happening?

version 5.0.45 + redhat 4

startlog on 10.0.0.40
2008-05-26 00:47:49 [ndbd] INFO -- Angel pid: 9117 ndb pid: 9118
2008-05-26 00:47:49 [ndbd] INFO -- NDB Cluster -- DB node 2
2008-05-26 00:47:49 [ndbd] INFO -- Version 5.0.45 --
2008-05-26 00:47:49 [ndbd] INFO -- Configuration fetched at 10.0.0.30 port 1186
2008-05-26 00:47:49 [ndbd] INFO -- Start initiated (version 5.0.45)

startlog on 10.0.0.41
2008-05-26 00:52:17 [ndbd] INFO -- Error handler shutting down system
2008-05-26 00:52:18 [ndbd] INFO -- Error handler shutdown completed - exiting
2008-05-26 00:52:18 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2008-05-26 00:54:35 [ndbd] INFO -- Angel pid: 10901 ndb pid: 10902
2008-05-26 00:54:35 [ndbd] INFO -- NDB Cluster -- DB node 3
2008-05-26 00:54:35 [ndbd] INFO -- Version 5.0.45 --
2008-05-26 00:54:35 [ndbd] INFO -- Configuration fetched at 10.0.0.30 port 1186
2008-05-26 00:54:35 [ndbd] INFO -- Start initiated (version 5.0.45)
2008-05-26 01:15:15 [ndbd] INFO -- NR: setLcpActiveStatusEnd - !m_participatingLQH
2008-05-26 01:21:38 [ndbd] INFO -- NR: setLcpActiveStatusEnd - m_participatingLQH 

How to repeat:
making a large number of insert statements one after each other , in my case 15,000.

Has this bug been verified on newer versions?
Try ndb-6.2 for example.

no, this is a production environment and I do not have enough hardware to set up a parallel environment to test on.  

Also even if I could set up such a test rig I actually still cannot backup the cluster so there is no way for me to get the data over to a 6.x test rig at present.

jc

sorry - ignore the last comment It was meant for another bug report.

But the answer is no, but only because I do not have a duplicate production architecture to reproduce this on.

I have to rely on your test rigs.

jc

What are num of replicas set to? can you include your configuration?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

We have experienced a similar incident; I was wondering if any fix/workaround had been identified.

No, I have not recieved any fix for it.  

However I stopped using the online backup facility and since then the cluster has been up and running continuously.  

Jonathan

Sorry, 

I should add that because I cannot backup via the online backup facility, and I do not have a duplictate set of servers with similar RAM I cannot really try out the upgrading based solutions so I am sort of stuck here.

jc

Hi!

What are number of replicas set to? 
Can you please include your configuration?

Here are the relevant parts of my config

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=2350M
IndexMemory=512M
MaxNoOfTables=1024
MaxNoOfAttributes=25000
MaxNoOfOpenFiles=100

[NDBD]
Id=2
HostName=10.0.0.40              # the IP of the FIRST SERVER
DataDir=/opt/mysql-cluster
MaxNoOfAttributes=20000
MaxNoOfOrderedIndexes=256
MaxNoOfUniqueHashIndexes=128
MaxNoOfConcurrentOperations=64000
MaxNoOfOpenFiles=100
DataMemory=2250M
IndexMemory=128M
MaxNoOfTables=1024
TimeBetweenLocalCheckpoints=6
NoOfFragmentLogFiles=32

[NDBD]
Id=3
HostName=10.0.0.41              # the IP of the SECOND SERVER
DataDir=/var/lib/mysql-cluster
MaxNoOfAttributes=20000
MaxNoOfOrderedIndexes=256
MaxNoOfUniqueHashIndexes=128
MaxNoOfConcurrentOperations=64000
DataMemory=2250M
IndexMemory=128M
MaxNoOfTables=1024
MaxNoOfOpenFiles=100
TimeBetweenLocalCheckpoints=6
NoOfFragmentLogFiles=32

Comments:
1) try newer version (e.g 6.3.27a or 7.0.8a)
2) wo/ error/trace-logs we can't make any progress

Setting status to need feedback

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".