MySQL Bugs: #57347: Mysql cluster missed heartbeat

Bug #57347	Mysql cluster missed heartbeat
Submitted:	9 Oct 2010 5:39	Modified:	9 Mar 2011 13:43
Reporter:	Sebastian Stach	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	mysql-5.1-telco-7.1	OS:	Linux (Debian Lenny)
Assigned to:		CPU Architecture:	Any
Tags:	missed heartbeat, MySQL Cluster, mysql-5.1.44 ndb-7.1.9a, ndb

Description:
We have configured mysql cluster for 4 servers, 2 sql nodes and 2 data nodes. During last 2 weeks we get info in logs:

[MgmtSrvr] WARNING  -- Node 2: Node 3 missed heartbeat 2
[MgmtSrvr] WARNING  -- Node 2: Node 3 missed heartbeat 3
[MgmtSrvr] WARNING  -- Node 2: Node 3 missed heartbeat 4

Then node is disconnected:

[MgmtSrvr] ALERT    -- Node 2: Node 3 declared dead due to missed heartbeat
[MgmtSrvr] INFO     -- Node 2: Communication to Node 3 closed
[MgmtSrvr] ALERT    -- Node 2: Network partitioning - arbitration required
[MgmtSrvr] INFO     -- Node 2: President restarts arbitration thread [state=7]
[MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
[MgmtSrvr] ALERT    -- Node 2: Arbitration won - positive reply from node 1
[MgmtSrvr] ALERT    -- Node 2: Node 3 Disconnected
[MgmtSrvr] ALERT    -- Node 1: Node 3 Disconnected
[MgmtSrvr] INFO     -- Node 2: Started arbitrator node 1 [ticket=2cf40002225971e6]
[MgmtSrvr] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not f
orm a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
[MgmtSrvr] INFO     -- Node 2: Communication to Node 3 opened

Error on Node 3:
Current byte-offset of file-pointer is: 568

Time: Saturday 9 October 2010 - 06:58:36
Status: Temporary error, restart node
Message: Node lost connection to other nodes and can not form a unpartitioned cluster, please investigate if there are error(s) on other node(s) (Arbitrati
on error)
Error: 2305
Error data: Arbitrator decided to shutdown this node
Error object: QMGR (Line: 5532) 0x0000000a
Program: /usr/local/mysql/bin/ndbd
Pid: 10014
Version: mysql-5.1.44 ndb-7.1.3
Trace: /var/lib/mysql-cluster//ndb_3_trace.log.1
***EOM***

Logs from Node 3:

WARNING: timerHandlingLab now: 16605114140 sent: 16605114076 diff: 64
WARNING: timerHandlingLab now: 16724612002 sent: 16724611952 diff: 50
WARNING: timerHandlingLab now: 16829200769 sent: 16829200717 diff: 52
WARNING: timerHandlingLab now: 16860325144 sent: 16860325092 diff: 52
WARNING: timerHandlingLab now: 16905397127 sent: 16905397076 diff: 51
WARNING: timerHandlingLab now: 16918067732 sent: 16918067680 diff: 52
WARNING: timerHandlingLab now: 16930181333 sent: 16930181247 diff: 86
WARNING: timerHandlingLab now: 16952552962 sent: 16952552912 diff: 50
WARNING: timerHandlingLab now: 16988033281 sent: 16988033228 diff: 53
WARNING: timerHandlingLab now: 17002453359 sent: 17002453307 diff: 52
2010-10-09 06:58:36 [ndbd] INFO     -- findNeighbours from: 4419 old (left: 2 right: 2) new (65535 65535)
2010-10-09 06:58:36 [ndbd] INFO     -- Arbitrator decided to shutdown this node
2010-10-09 06:58:36 [ndbd] INFO     -- QMGR (Line: 5532) 0x0000000a
2010-10-09 06:58:36 [ndbd] INFO     -- Error handler shutting down system
2010-10-09 06:58:36 [ndbd] INFO     -- Error handler shutdown completed - exiting
2010-10-09 06:58:37 [ndbd] ALERT    -- Node 3: Forced node shutdown completed. Caused by error 2305: 'Node lost connection to other nodes and can not form
a unpartitioned cluster, please investigate if there are error(s) on other node(s)(Arbitration error). Temporary error, restart node'.
2010-10-09 07:13:00 [ndbd] INFO     -- Angel pid: 7020 ndb pid: 7021
NDBMT: non-mt
2010-10-09 07:13:00 [ndbd] WARNING  -- MaxNoOfExecutionThreads (4) > LockExecuteThreadToCPU count (1), this could cause contention
2010-10-09 07:13:00 [ndbd] INFO     -- NDB Cluster -- DB node 3
2010-10-09 07:13:00 [ndbd] INFO     -- mysql-5.1.44 ndb-7.1.3 --
2010-10-09 07:13:00 [ndbd] INFO     -- WatchDog timer is set to 6000 ms
2010-10-09 07:13:00 [ndbd] INFO     -- Ndbd_mem_manager::init(1) min: 6104Mb initial: 6124Mb

It happedend to times for sql and for datanode.

NDB Config:

[TCP DEFAULT]
SendBufferMemory=4M
ReceiveBufferMemory=4M

[NDB_MGMD DEFAULT]
PortNumber=1186
Datadir=/var/lib/mysql-cluster/

[NDB_MGMD]
Id=1
Hostname=192.168.21.41
LogDestination=FILE:filename=ndb_1_cluster.log,maxsize=10000000,maxfiles=6
ArbitrationRank=1

[NDBD DEFAULT]
NoOfReplicas=2
Datadir=/var/lib/mysql-cluster/
FileSystemPathDD=/var/lib/mysql-cluster/
DataMemory=5379M
IndexMemory=673M
LockPagesInMainMemory=1

MaxNoOfConcurrentOperations=100000

StringMemory=25
MaxNoOfTables=4096
MaxNoOfOrderedIndexes=2048
MaxNoOfUniqueHashIndexes=512
MaxNoOfAttributes=24576
MaxNoOfTriggers=14336
DiskCheckpointSpeedInRestart=100M
FragmentLogFileSize=256M
InitFragmentLogFiles=SPARSE
NoOfFragmentLogFiles=32
RedoBuffer=48M

TimeBetweenLocalCheckpoints=20
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=100

MemReportFrequency=30
BackupReportFrequency=10

LogLevelStartup=15
LogLevelShutdown=15
LogLevelCheckpoint=8
LogLevelNodeRestart=15

BackupMaxWriteSize=1M
BackupDataBufferSize=16M
BackupLogBufferSize=4M
BackupMemory=20M

TimeBetweenWatchdogCheckInitial=60000

TransactionInactiveTimeout=60000

TransactionDeadlockDetectionTimeout=5000

SharedGlobalMemory=20M
DiskPageBufferMemory=64M

MaxNoOfExecutionThreads=4

LongMessageBuffer=16M

BatchSizePerLocalScan=512

[NDBD]
Id=2
Hostname=192.168.21.43

LockExecuteThreadToCPU=3
LockMaintThreadsToCPU=2

[NDBD]
Id=3
Hostname=192.168.21.44

LockExecuteThreadToCPU=3
LockMaintThreadsToCPU=2

[MYSQLD DEFAULT]
BatchSize=512

[MYSQLD]
Id=6
Hostname=192.168.21.41

[MYSQLD]
Id=7
Hostname=192.168.21.42

How to repeat:
Dont know :(

ndb_3_trace

Attachment: ndb_3_trace.log.rar (application/octet-stream, text), 27.50 KiB.

I have this problem almost every week.
Right now i have disconnect and connect on sql nodes.

No missed heartbeat, this are entries from ndb_1_cluster.log: 

2011-02-24 10:33:21 [MgmtSrvr] ALERT -- Node 3: Node 7 Disconnected 
2011-02-24 10:33:21 [MgmtSrvr] INFO -- Node 3: Communication to Node 7 closed 
2011-02-24 10:33:21 [MgmtSrvr] INFO -- Node 2: Communication to Node 7 closed 
2011-02-24 10:33:21 [MgmtSrvr] ALERT -- Node 2: Node 7 Disconnected 
2011-02-24 10:33:23 [MgmtSrvr] INFO -- Node 2: Communication to Node 7 opened 
2011-02-24 10:33:23 [MgmtSrvr] INFO -- Node 2: Node 7 Connected 
2011-02-24 10:33:23 [MgmtSrvr] INFO -- Node 2: Node 7: API mysql-5.1.51 ndb-7.1.9 
2011-02-24 10:33:25 [MgmtSrvr] INFO -- Node 3: Communication to Node 7 opened 
2011-02-24 10:33:25 [MgmtSrvr] INFO -- Node 3: Node 7 Connected 
2011-02-24 10:33:25 [MgmtSrvr] INFO -- Node 3: Node 7: API mysql-5.1.51 ndb-7.1.9 

This are logs from node 7, this reconnection always took about 4s. 

110224 10:33:21 [ERROR] Got error 4028 when reading table './ps/aa_nodes' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/brds' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/brds' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/cats' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/cats' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/actions' 
110224 10:33:21 [Note] NDB Binlog: Node: 2, down, Subscriber bitmask 00 
110224 10:33:21 [Note] NDB Binlog: Node: 3, down, Subscriber bitmask 00 
110224 10:33:21 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 28285206/7. 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/cats' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/cats' 
110224 10:33:21 [ERROR] Got error 4010 when reading table './ps/cats' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/users' 
110224 10:33:21 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in file: './ps/categories_premium.frm' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/actions_status' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/actions' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/cats' 
110224 10:33:21 [Note] NDB Binlog: cluster failure for ./mysql/ndb_apply_status at epoch 28285206/7. 
110224 10:33:21 [Note] Restarting Cluster Binlog 
110224 10:33:21 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in file: './ps/cats.frm' 
110224 10:33:21 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in file: './ps/users_places.frm' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/users' 
110224 10:33:21 [ERROR] Got error 157 when reading table './ps/actions' 
110224 10:33:21 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in file: './ps/cats.frm' 
110224 10:33:21 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in 
....... 

110224 10:33:22 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in file: './ps/cats_equipments.frm' 
110224 10:33:22 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in file: './ps/aa_nodes.frm' 

....... 

110224 10:33:23 [ERROR] /usr/local/mysql/bin/mysqld: Incorrect information in file: './ps/users.frm' 

....... 

110224 10:33:24 [Note] table './ps/actions' opened read only 
110224 10:33:24 [Note] table './ps/aa_nodes' opened read only 
110224 10:33:24 [Note] table './ps/actions' opened read only 
110224 10:33:24 [Note] table './ps/banners' opened read only 
110224 10:33:24 [Note] table './ps/actions_status' opened read only 
110224 10:33:24 [Note] table './ps/actions_status' opened read only 
110224 10:33:24 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_schema 
110224 10:33:25 [Note] NDB Binlog: logging ./mysql/ndb_schema (UPDATED,USE_WRITE) 
110224 10:33:25 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_apply_status 
110224 10:33:25 [Note] NDB Binlog: logging ./mysql/ndb_apply_status (UPDATED,USE_WRITE) 
110224 10:33:25 [Note] table './ps/actions_popup' opened read only 
110224 10:33:25 [Note] table './ps/users' opened read only 
110224 10:33:25 [Note] table './ps/cats' opened read only 
110224 10:33:26 [Note] table './ps/company_type_desc' opened read only 
110224 10:33:26 [Note] table './ps/company_type_desc' opened read only 
110224 10:33:26 [Note] table './ps/actions_status' opened read only 
110224 10:33:26 [Note] table './ps/actions_popup' opened read only 
2011-02-24 10:33:26 [NdbApi] INFO -- Flushing incomplete GCI:s < 28285211/3 
2011-02-24 10:33:26 [NdbApi] INFO -- Flushing incomplete GCI:s < 28285211/3 
110224 10:33:26 [Note] NDB Binlog: starting log at epoch 28285211/3 
110224 10:33:26 [Note] NDB Binlog: ndb tables writable 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 2, subscribe from node 6, Subscriber bitmask 040 
110224 10:33:26 [Note] NDB Binlog: Node: 3, subscribe from node 6, Subscriber bitmask 040

Hi, Sebastian.

I have just same problem as you, and, unfortunately, have not yet solved...

I would like to share this problem with you and am sure that it helps to resolve it.

Well, would you tell me whether you change default setting, and if any, please tell me each of that.
I come across the problem by default setting.

I assume that it causes from query that has many JOIN, because mysql cluster is not good at joining tables.

In addition, have you ever seen "Out of SendBufferMemory" error message before?
If you have, I appreciate how you came up with the solution.

Regards, 

Atsushi