Bug #22292 Cluster node dies while loading large dump file
Submitted: 13 Sep 2006 2:18 Modified: 23 Sep 2006 11:40
Reporter: Steve Wolf Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S2 (Serious)
Version:5.0.24a OS:Linux (CentOS 4.4 x86_64)
Assigned to: CPU Architecture:Any
Tags: declared dead, signal 0

[13 Sep 2006 2:18] Steve Wolf
Description:
When loading a database into a newly created NDB storage engine on a four-node cluster, a node dies.  The management node reports

Node 1: Forced node shutdown completed. Initiated by signal 0.

The node reports:

Status: Unknown
Message: No message slogan found (please report a bug if you get this error code) (Unknown)
Error: 0
Error data: We(1) have been declared dead by 2 reason: Hearbeat failure(4)
Error object: QMGR (Line: 2840) 0x0000000a
Program: /usr/local/mysql/bin/ndbd
Pid: 24025
Trace: /usr/local/mysql/data/ndb_1_trace.log.1
Version: Version 5.0.24

When I first created the cluster, I made it too small.  Part way through the data load, I got this error on Node 1 (Nodegroup 0).  When the NDB storage engine filled up, I got it again, this time on Node 3 (Nodegroup 1).  Now I'm loading the same data into a much larger 4-node cluster, and got the error much later in the process on Node 1 (Nodegroup 0).  Which causes me to speculate on the cause...

I believe this happens when the first node group fills up, and again when the second node group fills up.  Something happens at the point where the node group runs out of room.

I have classified this as S2 rather than S1, because the cluster does what it is supposed to do and continues to work.  The data load is interrupted with an error:

ERROR 1296 (HY000) at line 463: Got error 1 'Unknown error code' from ndbcluster

So I have to modify the input file, stripping out the activity from the beginning of the file to that line.  

How to repeat:
Create a small four-node cluster and populate it with a large dumpfile.
[13 Sep 2006 2:25] Steve Wolf
Trace file

Attachment: ndb_1_trace.log.1.gz (application/x-gzip, text), 65.28 KiB.

[13 Sep 2006 2:34] Steve Wolf
More on sizes.  My first attempt on four machines with 2GB RAM each had the global settings

DataMemory=1280M  # How much memory to allocate for data storage
IndexMemory=128M  # How much memory to allocate for index storage

My larger engine on the same four machines with 4GB RAM each has the global settings

DataMemory=2816M  # How much memory to allocate for data storage
IndexMemory=256M  # How much memory to allocate for index storage
[13 Sep 2006 9:22] Jonas Oreland
Hi,

A common cause of missed heartbeats is swapping.
This can be observed using top and/or vmstat

ndbd can also be locked in memory, causing _no_ swapping of it.
This is enabled by "LockPagesInMainMemory"
However you must be root when starting ndbd to use this option
  (or have a farily recent linux kernel, and modify ulimit -l)

Can you try/check this?

/Jonas
[13 Sep 2006 21:32] Steve Wolf
Locking memory indeed resolved the problem.  I was assuming the memory was locked by default.

I request a change that LockPagesInMainMemory=Y be the default.  Why wouldn't you want to lock memory?  If you don't, the cluster doesn't work.

Thanks for the diagnosis!

Regards,
Steve
[13 Sep 2006 21:35] Jonas Oreland
The reason for not being default is that you need to be root to use this
  (or have "kind" root, that lets you do it)

So I think will cause more problems than it will solve...
  (even if it's not very uncommon to experience this problem)

/Jonas
[14 Sep 2006 10:47] Steve Wolf
Make the default context dependent. If ndbd is started as root, the default value is True, otherwise it's False.  Pseudo-code:

    lock_memory_default = (!getuid() || !geteuid());

Just a thought.  Other than that, it's okay to close this bug.

Regards,
Steve