Description:
I'm running 2 mysql nodes, 2 data nodes, 1 mgmt node all on CentOS 5.3.
About 100MB of data, most all in memory, only 1 on-disk table.
NDBD is really ndbmtd.
A NDBD node reported the following messages and shut down.
No unusual query activity was happening in the cluster at the time. Simple select() monitoring queries every couple of minutes, that's it.
Full logs around time of error attached.
NDBD node error log (ndb_3_error.log) reported:
Time: Monday 13 July 2009 - 04:05:02
Status: Temporary error, restart node
Message: System error, node killed during node restart by other node (Internal error, programming error or missing error message
, please report a bug)
Error: 2303
Error data: Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_out.log files, Thanks. Shutt
ing down node due to failed handling of GCP_SAVEREQ
Error object: DBLQH (Line: 20332) 0x0000000a
Program: /usr/sbin/ndbd
Pid: 22167
NDBD node log (ndb_3_out.lg) reported:
2009-07-13 04:02:11 [ndbd] INFO -- Watchdog: User time: 10135784 System time: 10546073
2009-07-13 04:02:11 [ndbd] WARNING -- Watchdog: Warning overslept 203 ms, expected 100 ms.
GCP 0 file: 12 state: 1 sync: 0 page: 161 word: 7785
GCP 1 file: 16 state: 1 sync: 0 page: 335 word: 2315
GCP 2 file: 32 state: 1 sync: 0 page: 304 word: 3815
GCP 3 file: 52 state: 1 sync: 0 page: 250 word: 213
2009-07-13 04:04:51 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Job Handling elapsed=100
2009-07-13 04:05:02 [ndbd] INFO -- Watchdog: User time: 10145892 System time: 10553566
2009-07-13 04:05:02 [ndbd] INFO -- Watchdog: User time: 10145898 System time: 10553570
2009-07-13 04:05:02 [ndbd] WARNING -- Watchdog: Warning overslept 11176 ms, expected 100 ms.
2009-07-13 04:05:02 [ndbd] WARNING -- Ndb kernel thread 2 is stuck in: Job Handling elapsed=11277
2009-07-13 04:05:02 [ndbd] INFO -- Watchdog: User time: 10145898 System time: 10553570
2009-07-13 04:05:02 [ndbd] INFO -- Please report this as a bug. Provide as much info as possible, expecially all the ndb_*_o
ut.log files, Thanks. Shutting down node due to failed handling of GCP_SAVEREQ
2009-07-13 04:05:02 [ndbd] INFO -- DBLQH (Line: 20332) 0x0000000a
2009-07-13 04:05:02 [ndbd] INFO -- Error handler shutting down system
2009-07-13 04:05:02 [ndbd] INFO -- Error handler shutdown completed - exiting
2009-07-13 04:05:03 [ndbd] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2303: 'System error, node killed
during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary er
ror, restart node'.
MGM cluster log (ndb_1_cluster.log) reported:
2009-07-13 04:04:50 [MgmSrvr] INFO -- Node 2: Detected GCP stop(1)...sending kill to [SignalCounter: m_count=1 0000000000000
008]
2009-07-13 04:05:03 [MgmSrvr] ALERT -- Node 1: Node 3 Disconnected
2009-07-13 04:05:03 [MgmSrvr] ALERT -- Node 2: Node 3 Disconnected
2009-07-13 04:05:03 [MgmSrvr] ALERT -- Node 2: Network partitioning - arbitration required
2009-07-13 04:05:03 [MgmSrvr] INFO -- Node 2: President restarts arbitration thread [state=7]
2009-07-13 04:05:03 [MgmSrvr] INFO -- Node 2: Communication to Node 3 closed
2009-07-13 04:05:03 [MgmSrvr] ALERT -- Node 3: Forced node shutdown completed. Caused by error 2303: 'System error, node kill
ed during node restart by other node(Internal error, programming error or missing error message, please report a bug). Temporary
error, restart node'.
2009-07-13 04:05:03 [MgmSrvr] ALERT -- Node 2: Arbitration won - positive reply from node 1
2009-07-13 04:05:04 [MgmSrvr] INFO -- Node 2: Started arbitrator node 1 [ticket=62620004361ad876]
2009-07-13 04:05:18 [MgmSrvr] INFO -- Node 2: Communication to Node 3 opened
config.ini is:
[TCP DEFAULT]
SendBufferMemory=1M
ReceiveBufferMemory=1M
[NDBD DEFAULT]
# redundancy:
NoOfReplicas=2
# avoid swapping:
LockPagesInMainMemory=1
# Bypass FS cache:
ODirect=1
#
DataMemory=100MB
IndexMemory=32MB
#
MaxNoOfConcurrentTransactions=8092
MaxNoOfConcurrentOperations=65535
# 1.1 x MaxNoOfConcurrentOperations
MaxNoOfLocalOperations=72088
#
TransactionInactiveTimeout=5000
TransactionDeadlockDetectionTimeout=5000
#
HeartbeatIntervalDbApi=5000
HeartbeatIntervalDbDb=5000
#
TimeBetweenGlobalCheckpoints=1000
TimeBetweenEpochs=200
DiskCheckpointSpeed=10M
DiskCheckpointSpeedInRestart=100M
#
SchedulerSpinTimer=400
SchedulerExecutionTimer=100
RealTimeScheduler=1
#
DiskPageBufferMemory=32M
#
LogLevelStartup=5
LogLevelShutdown=5
LogLevelNodeRestart=5
LogLevelError=10
LogLevelCongestion=5
LogLevelInfo=5
MemReportFrequency=7200
[MYSQLD DEFAULT]
[NDB_MGMD DEFAULT]
DataDir=/var/lib/mysql-cluster # Directory for MGM node log files
#
# Managment Server
[NDB_MGMD]
Id=1
HostName=172.16.55.15 # the IP of THIS SERVER
DataDir=/var/lib/mysql-cluster # Directory for MGM node log files
# Storage Engines
[NDBD]
Id=2
HostName=172.16.55.13 # the IP of the FIRST SERVER
DataDir=/var/lib/mysql-cluster
# if more than 1 CPU
LockExecuteThreadToCPU=1
LockMaintThreadsToCPU=0
MaxNoOfExecutionThreads=2
[NDBD]
Id=3
HostName=172.16.55.14 # the IP of the SECOND SERVER
DataDir=/var/lib/mysql-cluster
# if more than 1 CPU
LockExecuteThreadToCPU=1
LockMaintThreadsToCPU=0
MaxNoOfExecutionThreads=2
# 2 MySQL Clients
# I personally leave this blank to allow rapid changes of the mysql clients;
# you can enter the hostnames of the above two servers here. I suggest you dont.
[MYSQLD]
Id=4
Hostname=172.16.55.16
[MYSQLD]
Id=5
Hostname=172.16.55.17
[API]
Id=99
[TCP]
NodeId1=2
NodeId2=3
HostName1=172.16.55.13
HostName2=172.16.55.14
How to repeat:
Haven't been able to reproduce yet.