Bug #33392 constant massive overload statements in cluster
Submitted: 20 Dec 2007 6:43 Modified: 10 Jun 2008 21:53
Reporter: Yong Lee Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.37 OS:Linux (redhat es4)
Assigned to: CPU Architecture:Any
Tags: overload

[20 Dec 2007 6:43] Yong Lee
Description:
We have the following configuration:
2 servers each running 1 mysql and 1 ndb node
2 servers each running an ndb_mgmd node
The ndb servers are dual core xeon processors with 8GB of ram each.

one of the ndb nodes is consistently going down around 2 or 3 times a day with the following in the logs:

Time: Wednesday 19 December 2007 - 00:40:59
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 10608
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.2
Version: Version 5.0.37
***EOM***

Time: Wednesday 19 December 2007 - 03:34:37
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 3973
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.3
Version: Version 5.0.37
***EOM***

Time: Wednesday 19 December 2007 - 05:16:14
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 13892
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.4
Version: Version 5.0.37
***EOM***

Time: Wednesday 19 December 2007 - 19:16:50
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 20201
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.5
Version: Version 5.0.37
***EOM***

Time: Wednesday 19 December 2007 - 20:29:21
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missi
ng error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 2699
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.6
Version: Version 5.0.37
***EOM***

Time: Wednesday 19 December 2007 - 20:30:53
Status: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: ArrayPool<T>::getPtr
Error object: ../../../../../ndb/src/kernel/vm/ArrayPool.hpp line: 360 (block: BACKUP)
Program: /usr/sbin/ndbd
Pid: 6618
Trace: /var/lib/mysql-cluster/ndb_3_trace.log.7
Version: Version 5.0.37
***EOM***

We are using the following config.ini file:
[NDBD DEFAULT]
NoOfReplicas=2
datadir=/var/lib/mysql-cluster
DataMemory=2400M
IndexMemory=150M
MaxNoOfAttributes=2000
MaxNoOfTables=128
MaxNoOfOrderedIndexes=400
MaxNoOfUniqueHashIndexes=400
MaxNoOfTriggers=1500
NoOfDiskPagesToDiskAfterRestartTUP=94
NoOfDiskPagesToDiskAfterRestartACC=13
NoOfDiskPagesToDiskDuringRestartTUP=94
NoOfDiskPagesToDiskDuringRestartACC=13
NoOfFragmentLogFiles=9
MaxNoOfConcurrentTransactions=10000
MaxNoOfConcurrentOperations=100000
StopOnError=0
LogLevelStartup=15
LogLevelNodeRestart=15
LogLevelError=15
LogLevelShutdown=15

StartFailureTimeout=600000
TransactionInactiveTimeout=5000
TimeBetweenWatchDogCheck=30000

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]
datadir=/var/lib/mysql-cluster/nsg
PortNumber=1186

[TCP DEFAULT]

# Primary Mgmt Server
[NDB_MGMD]
id=1
hostname=10.100.40.4
ArbitrationRank=1

[NDB_MGMD]
id=2
hostname=10.100.40.24
ArbitrationRank=2

[NDBD]
id=3
hostname=10.100.40.1

[NDBD]
id=4
hostname=10.100.40.21

[MYSQLD]
[MYSQLD]
[MYSQLD]

- I don't think swapping is an issue as we have lots of memory.  The current top output looks like this :

top - 22:32:45 up 67 days, 13:33,  1 user,  load average: 0.24, 0.28, 0.22
Tasks:  57 total,   1 running,  56 sleeping,   0 stopped,   0 zombie
Cpu(s): 12.9% us,  4.4% sy,  0.0% ni, 82.8% id,  0.0% wa,  0.0% hi,  0.0% si
Mem:   8310552k total,  8292240k used,    18312k free,   139712k buffers
Swap:  4192956k total,      168k used,  4192788k free,  4814320k cached

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 6909 mysql     15   0 3027m 1.4g 1916 S   18 17.7   8:41.37 ndbd
 6396 mysql     15   0 1730m 1.7g 3960 S   16 20.8 691:00.73 mysqld
 6675 mysql     16   0 17680 1972  508 S    0  0.0   0:00.03 ndbd

- Current free space is as follows:

free -m
             total       used       free     shared    buffers     cached
Mem:          8115       8098         17          0        136       4701
-/+ buffers/cache:       3259       4855
Swap:         4094          0       4094

So i don't think we're hitting the swap at all

- There are backups running every 15 minutes and mysqldumps that go once a night.  There are other things affecting bandwidth but there doesn't seem to be a nice correlation between these activites and the outages.

- We do see that during every outage there is a spike in data traffic between the 2 node servers.  We are using a gigabyte lan and the amount of traffic is pretty small.  We average 10Mb/s between the servers but are having spikes of 40 to 50 M whenever the outage occurs (i'm thinking that these spikes are probably due to the node restarting rather than being the cause of the outage).

- Looking at our load average stats, we don't go over 1.0

- I have the trace files available to upload.  I hope they can be of help.

- Some other information from the mysqlbug:

>C compiler:    gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
>C++ compiler:  gcc (GCC) 3.4.4 20050721 (Red Hat 3.4.4-2)
>Environment:
        <machine, os, target, libraries (multiple lines)>
System: Linux vanprdb0.i.eqo.com 2.6.9-42.0.3.ELsmp #1 SMP Mon Sep 25 17:28:02 EDT 2006 i686 i686 i386 GNU/Linux
Architecture: i686

Some paths:  /usr/bin/perl /usr/bin/make /usr/bin/gmake

Compilation info: CC='gcc'  CFLAGS='-O2 -g -pipe -m32 -march=i386 -mtune=pentium4'  CXX='gcc'  CXXFLAGS='-O2 -g -pipe -m32 -march=i386 -mtune=pentium4'  LDFLAGS=''  ASFLAGS=''
LIBC:
lrwxrwxrwx  1 root root 13 Feb 14  2007 /lib/libc.so.6 -> libc-2.3.4.so
-rwxr-xr-x  1 root root 1442931 Jul  4  2006 /lib/libc-2.3.4.so
-rw-r--r--  1 root root 2418632 Jul  4  2006 /usr/lib/libc.a
-rw-r--r--  1 root root 204 Jul  4  2006 /usr/lib/libc.so
-rwxr-xr-x  1 root root 767416 Nov 28  2005 /usr/lib/libc-client.so.0
Configure command: ./configure '--disable-shared' '--with-server-suffix=-community' '--without-embedded-server' '--with-ndbcluster' '--with-innodb' '--with-csv-storage-engine' '--with-archive-storage-engine' '--with-blackhole-storage-engine' '--with-example-storage-engine' '--with-federated-storage-engine' '--without-bench' '--with-zlib-dir=bundled' '--with-big-tables' '--enable-assembler' '--enable-local-infile' '--with-mysqld-user=mysql' '--with-unix-socket-path=/var/lib/mysql/mysql.sock' '--with-pic' '--prefix=/' '--with-extra-charsets=all' '--with-yassl' '--exec-prefix=/usr' '--libexecdir=/usr/sbin' '--libdir=/usr/lib' '--sysconfdir=/etc' '--datadir=/usr/share' '--localstatedir=/var/lib/mysql' '--infodir=/usr/share/info' '--includedir=/usr/include' '--mandir=/usr/share/man' '--enable-thread-safe-client' '--with-comment=MySQL Community Edition (GPL)' '--with-readline' 'CC=gcc' 'CFLAGS=-O2 -g -pipe -m32 -march=i386 -mtune=pentium4' 'CXXFLAGS=-O2 -g -pipe -m32 -march=i386 -mtune=pentium4' 'CXX=gcc' 'LDFLAGS='

- Also note that i've been updating the watchdog timeout values to no avail.

How to repeat:
- I've setup an independent system (dual cpu with 4GB ram) and loaded it up with massive inserts updates and deletes which all seem to work okay.

- I hope someone can look at the trace files and suggest some things i could to do try and pinpoint the root cause of this issue.  Many thanks in advance.
[20 Dec 2007 6:47] Yong Lee
Also note that although this problem is occurring primarily on node 3, i do get occassional massive overload statements on ndb node 4:

Time: Wednesday 19 December 2007 - 22:41:08
Status: Temporary error, restart node
Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing
error message, please report a bug)
Error: 6050
Error data: Job Handling
Error object: WatchDog.cpp
Program: /usr/sbin/ndbd
Pid: 668
Trace: /var/lib/mysql-cluster/ndb_4_trace.log.9
Version: Version 5.0.37
***EOM***
[30 Jan 2008 23:26] Yong Lee
first secion of log trace file

Attachment: ndb_4_trace.log.1 (application/octet-stream, text), 13.86 KiB.

[15 May 2008 2:54] Yong Lee
also note that the exact same query run against myisam tables do not have any impact on the virutal or resident memory sizes (ie: i created a copy of the tables using the same create table syntax but with myisam storage engines and then ran the same query, monitoring memory through top).
[4 Jun 2008 20:40] Sveta Smirnova
Thank you for the report.

We noticed 3 problems in data provided:

1. Version 5.0.37 is old. Please try current version 5.0.51b
2. You have mysqld and ndbd on the same host which is not recommended.
3. Last message in the trace file " FileFlags: H'00000302 Open read and write, Create new file, Truncate existing file" Which can indicate hardware problems.

Please check your hardware, try with current version 5.0.51b and inform us if problem still exists.
[10 Jun 2008 21:53] Yong Lee
yes, the problem seems to have gone away now that we have upgraded to 5.0.51a.