Bug #41203 data node crashing and the restart causes the other to crash (error NR: setLcpAc
Submitted: 3 Dec 2008 14:19 Modified: 12 Nov 2009 10:47
Reporter: Jonathan Carter Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.0 OS:Linux
Assigned to: CPU Architecture:Any
Tags: cluster, crash, mysql-5.0.45, ndb

[3 Dec 2008 14:19] Jonathan Carter
Description:
I have a setup running for 6 months so far but recently I have a strange problem as follows :

topology
1 - Management 10.0.0.30
2 - datanode1 10.0.0.40
3 - datanode2 10.0.0.41

symptom
firstly 10.0.0.40 drops out of communication giving this :

id=2 (not connected, accepting connect from 10.0.0.40)
id=3 @10.0.0.41 (Version: 5.0.45, Nodegroup: 0, Master)

I then kill the ndbd on 10.0.0.40 and get the following on the management node

ndb_mgm> Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'. - Unknown error code: Unknown result: Unknown error code

then both are not connected

id=2 (not connected, accepting connect from 10.0.0.40)
id=3 (not connected, accepting connect from 10.0.0.41)

I start 10.0.0.40 and 10.0.0.41 with the ndbd command and then after 20 mins they both join the cluster and all is well again.

does anybody know what I can do to prevent this happening?

version 5.0.45 + redhat 4

startlog on 10.0.0.40
2008-05-26 00:47:49 [ndbd] INFO -- Angel pid: 9117 ndb pid: 9118
2008-05-26 00:47:49 [ndbd] INFO -- NDB Cluster -- DB node 2
2008-05-26 00:47:49 [ndbd] INFO -- Version 5.0.45 --
2008-05-26 00:47:49 [ndbd] INFO -- Configuration fetched at 10.0.0.30 port 1186
2008-05-26 00:47:49 [ndbd] INFO -- Start initiated (version 5.0.45)

startlog on 10.0.0.41
2008-05-26 00:52:17 [ndbd] INFO -- Error handler shutting down system
2008-05-26 00:52:18 [ndbd] INFO -- Error handler shutdown completed - exiting
2008-05-26 00:52:18 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2008-05-26 00:54:35 [ndbd] INFO -- Angel pid: 10901 ndb pid: 10902
2008-05-26 00:54:35 [ndbd] INFO -- NDB Cluster -- DB node 3
2008-05-26 00:54:35 [ndbd] INFO -- Version 5.0.45 --
2008-05-26 00:54:35 [ndbd] INFO -- Configuration fetched at 10.0.0.30 port 1186
2008-05-26 00:54:35 [ndbd] INFO -- Start initiated (version 5.0.45)
2008-05-26 01:15:15 [ndbd] INFO -- NR: setLcpActiveStatusEnd - !m_participatingLQH
2008-05-26 01:21:38 [ndbd] INFO -- NR: setLcpActiveStatusEnd - m_participatingLQH 

How to repeat:
making a large number of insert statements one after each other , in my case 15,000.
[11 Dec 2008 13:06] Martin Skold
Has this bug been verified on newer versions?
Try ndb-6.2 for example.
[11 Dec 2008 13:16] Jonathan Carter
no, this is a production environment and I do not have enough hardware to set up a parallel environment to test on.  

Also even if I could set up such a test rig I actually still cannot backup the cluster so there is no way for me to get the data over to a 6.x test rig at present.

jc
[11 Dec 2008 13:33] Jonathan Carter
sorry - ignore the last comment It was meant for another bug report.

But the answer is no, but only because I do not have a duplicate production architecture to reproduce this on.

I have to rely on your test rigs.

jc
[25 Mar 2009 11:13] Jonathan Miller
What are num of replicas set to? can you include your configuration?
[25 Apr 2009 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[29 Sep 2009 15:00] Bertrand Rault
We have experienced a similar incident; I was wondering if any fix/workaround had been identified.
[30 Sep 2009 8:17] Jonathan Carter
No, I have not recieved any fix for it.  

However I stopped using the online backup facility and since then the cluster has been up and running continuously.  

Jonathan
[30 Sep 2009 8:23] Jonathan Carter
Sorry, 

I should add that because I cannot backup via the online backup facility, and I do not have a duplictate set of servers with similar RAM I cannot really try out the upgrading based solutions so I am sort of stuck here.

jc
[5 Oct 2009 13:40] Jørgen Austvik
Hi!

What are number of replicas set to? 
Can you please include your configuration?
[5 Oct 2009 14:01] Jonathan Carter
Here are the relevant parts of my config

[NDBD DEFAULT]
NoOfReplicas=2
DataMemory=2350M
IndexMemory=512M
MaxNoOfTables=1024
MaxNoOfAttributes=25000
MaxNoOfOpenFiles=100

[NDBD]
Id=2
HostName=10.0.0.40              # the IP of the FIRST SERVER
DataDir=/opt/mysql-cluster
MaxNoOfAttributes=20000
MaxNoOfOrderedIndexes=256
MaxNoOfUniqueHashIndexes=128
MaxNoOfConcurrentOperations=64000
MaxNoOfOpenFiles=100
DataMemory=2250M
IndexMemory=128M
MaxNoOfTables=1024
TimeBetweenLocalCheckpoints=6
NoOfFragmentLogFiles=32

[NDBD]
Id=3
HostName=10.0.0.41              # the IP of the SECOND SERVER
DataDir=/var/lib/mysql-cluster
MaxNoOfAttributes=20000
MaxNoOfOrderedIndexes=256
MaxNoOfUniqueHashIndexes=128
MaxNoOfConcurrentOperations=64000
DataMemory=2250M
IndexMemory=128M
MaxNoOfTables=1024
MaxNoOfOpenFiles=100
TimeBetweenLocalCheckpoints=6
NoOfFragmentLogFiles=32
[12 Oct 2009 10:47] Jonas Oreland
Comments:
1) try newer version (e.g 6.3.27a or 7.0.8a)
2) wo/ error/trace-logs we can't make any progress

Setting status to need feedback
[13 Nov 2009 0:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".