Bug #39741 ndbd process restarts while recovering from loss of network connectivity
Submitted: 29 Sep 2008 23:29 Modified: 30 Sep 2008 10:26
Reporter: Praveen Sooryanarayana Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:mysql-5.1.23 ndb-6.2.15 OS:Other (Cent OS)
Assigned to: CPU Architecture:Any

[29 Sep 2008 23:29] Praveen Sooryanarayana
Description:
http://dev.mysql.com/tech-resources/articles/mysql-cluster-for-two-servers.html

I have a 2 server setup with the same config.ini on both the servers.

[ndbd default]
NoOfReplicas=2
DataMemory=80M
IndexMemory=18M
StopOnError=0

[tcp default]
portnumber=2202

[ndb_mgmd]
hostname=x.x.x.28
datadir=/var/lib/mysql-cluster
ArbitrationRank=0
[ndb_mgmd]
hostname=x.x.x.44
datadir=/var/lib/mysql-cluster
ArbitrationRank=0

[ndbd]
hostname=x.x.x.28
datadir=/var/lib/mysql/data
[ndbd]
hostname=x.x.x.44
datadir=/var/lib/mysql/data

[mysqld]
hostname=x.x.x.28
ArbitrationRank=0
[mysqld]
hostname=x.x.x.44
ArbitrationRank=0
[api]
hostname=x.x.x.28
ArbitrationRank=0
[api]
hostname=x.x.x.44
ArbitrationRank=0

my.cnf on the servers:

[mysqld]
ndbcluster
ndb-connectstring=x.x.x.28

[mysql_cluster]
ndb-connectstring=x.x.x.28

AND

[mysqld]
ndbcluster
ndb-connectstring=x.x.x.44

[mysql_cluster]
ndb-connectstring=x.x.x.44

So I have 2 mgm nodes, 2 data nodes and 2 sql nodes. When there is a loss of connectivity between the two servers, they run independently but when the connectivity is restored, one of the data nodes gets restarted.

the error.log shows:

Current byte-offset of file-pointer is: 568

Time: Monday 29 September 2008 - 15:57:18
Status: Temporary error, restart node
Message: Partitioned cluster detected. Please check if cluster is already running (Arbitration error)
Error: 2307
Error data: We(4) have been declared dead by 3 reason: Our cluster: 0000000000000010(8)
Error object: QMGR (Line: 3237) 0x0000000c
Program: ndbd
Pid: 22471
Trace: /var/lib/mysql/data/ndb_4_trace.log.1
Version: mysql-5.1.23 ndb-6.2.15
***EOM***

and the out.log shows:

2008-09-29 15:53:29 [ndbd] INFO     -- Angel pid: 22470 ndb pid: 22471
2008-09-29 15:53:29 [ndbd] INFO     -- NDB Cluster -- DB node 4
2008-09-29 15:53:29 [ndbd] INFO     -- mysql-5.1.23 ndb-6.2.15 --
2008-09-29 15:53:29 [ndbd] INFO     -- Configuration fetched at 172.19.30.44 port 1186
2008-09-29 15:53:29 [ndbd] INFO     -- Start initiated (mysql-5.1.23 ndb-6.2.15)
2008-09-29 15:53:29 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 80Mb initial: 100Mb
Adding 100Mb to ZONE_LO (1,3199)
WOPool::init(61, 9)
RWPool::init(22, 13)
RWPool::init(42, 18)
RWPool::init(62, 13)
RWPool::init(c2, 18)
RWPool::init(e2, 15)
WOPool::init(41, 8)
RWPool::init(82, 12)
RWPool::init(a2, 51)
WOPool::init(21, 6)
m_active_buckets.set(1)
alloc_chunk(8 16) -
start_resend(0, empty bucket -> active
check StartPartialTimeout, node 3 thinks 3 is president, I think president is: 4
2008-09-29 15:57:18 [ndbd] INFO     -- We(4) have been declared dead by 3 reason: Our cluster: 0000000000000010(8)
2008-09-29 15:57:18 [ndbd] INFO     -- QMGR (Line: 3237) 0x0000000c
2008-09-29 15:57:18 [ndbd] INFO     -- Error handler restarting system
2008-09-29 15:57:19 [ndbd] INFO     -- Error handler shutdown completed - exiting
2008-09-29 15:57:19 [ndbd] ALERT    -- Node 4: Forced node shutdown completed. Caused by error 2307: 'Partitioned cluster detected. Please check if cluster is already running(Arbitration error). Temporary error, restart node'.
2008-09-29 15:57:19 [ndbd] INFO     -- Ndb has terminated (pid 22471) restarting
2008-09-29 15:57:20 [ndbd] INFO     -- Angel pid: 22470 ndb pid: 22667
2008-09-29 15:57:20 [ndbd] INFO     -- NDB Cluster -- DB node 4
2008-09-29 15:57:20 [ndbd] INFO     -- mysql-5.1.23 ndb-6.2.15 --
2008-09-29 15:57:20 [ndbd] INFO     -- Configuration fetched at 172.19.30.44 port 1186
2008-09-29 15:57:20 [ndbd] INFO     -- Start initiated (mysql-5.1.23 ndb-6.2.15)
2008-09-29 15:57:20 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 80Mb initial: 100Mb
Adding 100Mb to ZONE_LO (1,3199)
WOPool::init(61, 9)
RWPool::init(22, 13)
RWPool::init(42, 18)
RWPool::init(62, 13)
RWPool::init(c2, 18)
RWPool::init(e2, 15)
WOPool::init(41, 8)
RWPool::init(82, 12)
RWPool::init(a2, 51)
WOPool::init(21, 6)
RESTORE table: 0 1037 rows applied
RESTORE table: 0 1011 rows applied
RESTORE table: 1 0 rows applied
RESTORE table: 1 0 rows applied
2008-09-29 15:57:28 [ndbd] INFO     -- NR: setLcpActiveStatusEnd - m_participatingLQH
alloc_chunk(27 16) -
114/0 (113/4294967295) switchover complete bucket 1 state: 1starting

How to repeat:
With the setup in the description, create a loss of connectivity by removing the network cable from one of the servers and after a few minutes restore the connectivity.

Suggested fix:
The data node should not get restarted.
[29 Sep 2008 23:30] Praveen Sooryanarayana
Please ignore the first line (the url) of the description
[30 Sep 2008 10:26] Hartmut Holzgraefe
Is there a special reason for setting ArbitrationRank=0 on all management and mysqld nodes?

Without any node that can act as arbitrator the split halves of the cluster can't decide who has a majority and so both keep running and only detect that there has been a split-brain situation when they join again and detect that they have an inconsistant state ...

So not a bug but expected behavior with the configuration you are using.