Description:
We have the following configuration:
2 servers each running 1 mysql and 1 ndb node
2 servers each running an ndb_mgmd node
The ndb servers are dual core xeon processors with 8GB of ram each.
one of the ndb nodes is consistently going down around 2 or 3 times a day with the following in the logs:
Time: Wednesday 19 December 2007 - 00:40:59
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 10608
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.2
Version: Version 5.0.37
***EOM***
Time: Wednesday 19 December 2007 - 03:34:37
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 3973
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.3
Version: Version 5.0.37
***EOM***
Time: Wednesday 19 December 2007 - 05:16:14
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 13892
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.4
Version: Version 5.0.37
***EOM***
Time: Wednesday 19 December 2007 - 19:16:50
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 20201
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.5
Version: Version 5.0.37
***EOM***
Time: Wednesday 19 December 2007 - 20:29:21
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 2699
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.6
Version: Version 5.0.37
***EOM***
Time: Wednesday 19 December 2007 - 20:30:53
Status: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: ArrayPool<T>::getPtr
Error object: ../../../../../ndb/src/kernel/vm/ArrayPool.hpp line: 360 (block: BACKUP)
Program: /usr/sbin/ndbd
Pid: 6618
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.7
Version: Version 5.0.37
***EOM***
We are using the following config.ini file:
[NDBD DEFAULT]
NoOfReplicas=2
datadir=/var/lib/mysql-cluster
DataMemory=2400M
IndexMemory=150M
MaxNoOfAttributes=2000
MaxNoOfTables=128
MaxNoOfOrderedIndexes=400
MaxNoOfUniqueHashIndexes=400
MaxNoOfTriggers=1500
NoOfDiskPagesToDiskAfterRestartTUP=94
NoOfDiskPagesToDiskAfterRestartACC=13
NoOfDiskPagesToDiskDuringRestartTUP=94
NoOfDiskPagesToDiskDuringRestartACC=13
NoOfFragmentLogFiles=9
MaxNoOfConcurrentTransactions=10000
MaxNoOfConcurrentOperations=100000
StopOnError=0
LogLevelStartup=15
LogLevelNodeRestart=15
LogLevelError=15
LogLevelShutdown=15
StartFailureTimeout=600000
TransactionInactiveTimeout=5000
TimeBetweenWatchDogCheck=30000
[MYSQLD DEFAULT]
[NDB_MGMD DEFAULT]
datadir=/var/lib/mysql-cluster/nsg
PortNumber=1186
[TCP DEFAULT]
# Primary Mgmt Server
[NDB_MGMD]
id=1
hostname=10.100.40.4
ArbitrationRank=1
[NDB_MGMD]
id=2
hostname=10.100.40.24
ArbitrationRank=2
[NDBD]
id=3
hostname=10.100.40.1
[NDBD]
id=4
hostname=10.100.40.21
[MYSQLD]
[MYSQLD]
[MYSQLD]
- I don't think swapping is an issue as we have lots of memory. The current top output looks like this :
top - 22:32:45 up 67 days, 13:33, 1 user, load average: 0.24, 0.28, 0.22
Tasks: 57 total, 1 running, 56 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.9% us, 4.4% sy, 0.0% ni, 82.8% id, 0.0% wa, 0.0% hi, 0.0% si
Mem: 8310552k total, 8292240k used, 18312k free, 139712k buffers
Swap: 4192956k total, 168k used, 4192788k free, 4814320k cached
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6909 mysql 15 0 3027m 1.4g 1916 S 18 17.7 8:41.37 ndbd
6396 mysql 15 0 1730m 1.7g 3960 S 16 20.8 691:00.73 mysqld
6675 mysql 16 0 17680 1972 508 S 0 0.0 0:00.03 ndbd
- Current free space is as follows:
free -m
total used free shared buffers cached
Mem: 8115 8098 17 0 136 4701
-/+ buffers/cache: 3259 4855
Swap: 4094 0 4094
So i don't think we're hitting the swap at all
- There are backups running every 15 minutes and mysqldumps that go once a night. There are other things affecting bandwidth but there doesn't seem to be a nice correlation between these activites and the outages.
- We do see that during every outage there is a spike in data traffic between the 2 node servers. We are using a gigabyte lan and the amount of traffic is pretty small. We average 10Mb/s between the servers but are having spikes of 40 to 50 M whenever the outage occurs (i'm thinking that these spikes are probably due to the node restarting rather than being the cause of the outage).
- Looking at our load average stats, we don't go over 1.0
- I have the trace files available to upload. I hope they can be of help.
- Some other information from the mysqlbug:
>C compiler: gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
>C++ compiler: gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
>Environment:
<machine, os, target, libraries (multiple lines)>
System: Linux vanprdb0.i.eqo.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux
Architecture: i686
Some paths: /usr/bin/perl /usr/bin/make /usr/bin/gmake
Compilation info: CC='gcc' CFLAGS='-O2 -g -pipe -m32 -march=i386 -mtune=pentium4' CXX='gcc' CXXFLAGS='-O2 -g -pipe -m32 -march=i386 -mtune=pentium4' LDFLAGS='' ASFLAGS=''
LIBC:
lrwxrwxrwx 1 root root 13 Feb 14 2007 /lib/libc.so.6 -> libc-2.3.4.so
-rwxr-xr-x 1 root root 1442931 Jul 4 2006 /lib/libc-2.3.4.so
-rw-r--r-- 1 root root 2418632 Jul 4 2006 /usr/lib/libc.a
-rw-r--r-- 1 root root 204 Jul 4 2006 /usr/lib/libc.so
-rwxr-xr-x 1 root root 767416 Nov 28 2005 /usr/lib/libc-client.so.0
Configure command: ./configure '--disable-shared' '--with-server-suffix=-community' '--without-embedded-server' '--with-ndbcluster' '--with-innodb' '--with-csv-storage-engine' '--with-archive-storage-engine' '--with-blackhole-storage-engine' '--with-example-storage-engine' '--with-federated-storage-engine' '--without-bench' '--with-zlib-dir=bundled' '--with-big-tables' '--enable-assembler' '--enable-local-infile' '--with-mysqld-user=mysql' '--with-unix-socket-path=/var/lib/mysql/mysql.sock' '--with-pic' '--prefix=/' '--with-extra-charsets=all' '--with-yassl' '--exec-prefix=/usr' '--libexecdir=/usr/sbin' '--libdir=/usr/lib' '--sysconfdir=/etc' '--datadir=/usr/share' '--localstatedir=/var/lib/mysql' '--infodir=/usr/share/info' '--includedir=/usr/include' '--mandir=/usr/share/man' '--enable-thread-safe-client' '--with-comment=MySQL Community Edition (GPL)' '--with-readline' 'CC=gcc' 'CFLAGS=-O2 -g -pipe -m32 -march=i386 -mtune=pentium4' 'CXXFLAGS=-O2 -g -pipe -m32 -march=i386 -mtune=pentium4' 'CXX=gcc' 'LDFLAGS='
- Also note that i've been updating the watchdog timeout values to no avail.
How to repeat:
- I've setup an independent system (dual cpu with 4GB ram) and loaded it up with massive inserts updates and deletes which all seem to work okay.
- I hope someone can look at the trace files and suggest some things i could to do try and pinpoint the root cause of this issue. Many thanks in advance.