MySQL Bugs: #32322: Cluster Backup ends in Segmentation fault and lost node

Bug #32322	Cluster Backup ends in Segmentation fault and lost node
Submitted:	13 Nov 2007 15:39	Modified:	12 Oct 2009 8:28
Reporter:	Michael Neubert	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	mysql-5.1-telco-6.2	OS:	Linux
Assigned to:	Jonas Oreland	CPU Architecture:	Any
Tags:	mysql-5.1.22 ndb-6.2.7, segmentation fault

Description:
Hello,

after starting a Cluster backup, one node was killed by a segmentation fault (Signal 11).

Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: main.cpp
Program: ndbd
Pid: 14283
Trace: /var/log/mysql/ndb_4_trace.log.17
Version: mysql-5.1.22 ndb-6.2.7-beta

See also attached trace file.

Beste wishes
Michael

How to repeat:
see scenario above

Suggested fix:
no fix available

Please provide the ndb_4_error.log and ndb_4_trace.log.17 for analysis.
If a core file was written add it, too.

It is unlikely that a core file was generated though, to enable
core files for ndbd processes you need to start them with the 
--core-file option and need to make sure that core files may
be created without size restrictions using the "ulimit -c unlimited"
shell command right before starting the ndbd process ...

Trace and error log are now uploaded.

Unfortunately core file is not available, because of the reasons mentioned.

1) Is it repeatable?
2) Can we then get schema+data?

/Jonas

At the moment the segmentation fault is not repeatable. After restartung the node Backups seems to work as expected. If the error occurs again, I will inform you within this thread.

I'm sorry, but data (over 16 GB) and schema (many databases) cannot be transfered.

Hello,

the segmentation fault occurred again. After restarting die broken node (2 times) and starting a new backup (2 times), there were the same problems. This time there was also a new kind of error code (see below).

Time: Saturday 17 November 2007 - 03:40:24
Status: Temporary error, restart node
Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug)
Error: 6000
Error data: Signal 11 received; Segmentation fault
Error object: main.cpp
Program: ndbd
Pid: 18484
Trace: /var/log/mysql/ndb_2_trace.log.11
Version: mysql-5.1.22 ndb-6.2.7-beta
***EOM***

Time: Saturday 17 November 2007 - 04:57:45
Status: Temporary error, restart node
Message: Assertion (Internal error, programming error or missing error message, please report a bug)
Error: 2301
Error data: ArrayPool<T>::getPtr
Error object: ../../../../../storage/ndb/src/kernel/vm/ArrayPool.hpp line: 349 (block: DBTUP)
Program: ndbd
Pid: 416
Trace: /var/log/mysql/ndb_2_trace.log.12
Version: mysql-5.1.22 ndb-6.2.7-beta

See also the new attached files.

I think I have found the reason for the problems. There were tables with BLOB cloumns in it. After deleting those tables, the Backup works fine again.

Beste wishes
Michael

What Feedback do you need?

Hello,

the problem still exists after upgrading to NDB 6.2.14.

Best wishes
Michael

Is that repeatable only using your huge data?
Can you provide a simple way to repeat the bug and provide schema and data.
Is the entironment of cluster the same with bug#38264?

i am encountering a similar behaviour using cluster 6.3.17 with approximately 10 gigs of data. i get the issue when trying to import a new database to the system. it doesn't really matter which database, it seems it's related to reaching a certain amount of size. at some point in the import, one of the nodes dies (i have a 10 ndb setup). i'm running the cluster on centos 5.2 32 bit, kernel 2.6 with PAE. can the 32 bit be an issue in handling the large ram (i allocated 6GB to ndb). do you think a 64 bit OS would be better?

also, another strange behaviour that i encountered while setting up the cluster:
i initially had 4gigs of ram per machine and allocated 2G for each ndb. then we added an extra 4gigs to each box so we had 8 in total. when doing a complete restart to update the new configuration we encountered startup failures with the following values (we had about 2.7G data spread across the nodes in 2 replica at that time):
- setting memory to 5G per ndb ... always fails to start
- setting memory to 3G per ndb ... always fails to start
- setting memory to 6G per ndb ... always starts correctly

so, again: can there be any issue caused by using a 32 bit architecture?

Please try to reproduct in 64 bits box.
If you get the same error, please attach the log/trace file and config file.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

hey

so after some tests i think i can put the blame on the 32 bit version. i have installed the 64 bit rpms and everything runs smooth. other then some overloaded redo log files while importing the databases, i didn't encounter any error.

i'm not really sure i'm right but from what i understand, although a PAE kernel can handle more then 3GB of RAM, a single process is limited to max 3GB. that might be the issue. from the tests i ran. let me detail a bit the previous steps i did that may sustain this argument:
a) the problem: random seg fault errors with configuration allowing 6GB of RAM to each ndbd.
b) what i did before reaching the 64 bit version:
- did a lot of attempts with the 6GB configuration. restarted with --initial a lot of times and the error came out in a random pattern ... sometimes it allowed me to do a lot of imorts, sometimes it would kick in real soon.
- the hint that i remembered related to the 3GB limit was that when i managed to get the most data into the cluster, the data usage on the nodes was approximatly 2.1GB and about 100med of index.
- i reverted to mysql cluster 5.0.67. what happened in this case is that it never allowed me to allocate more than 2.2 GB of RAM per node (i used 2gb ndbd, 200 index). it kept dieing during startup with "memory allocation failure". this is when i connected the 2 pieces and thought that i might be dealing with the 3GB limit.
- i reinstalled the boxes with 64bit OS and than put cluster 6.3.17 again. it worked like a charm. tried several configuration without any problems. it swallowed all my data (approximately 5GB per node) and is very stable with restarts.

here are the config.ini variables that i used across all the tests:
NoOfReplicas=2
DataMemory=6000M
IndexMemory=500M

MaxNoOfConcurrentOperations = 300000
MaxNoOfAttributes=9000
MaxNoOfTables=2000
MaxNoOfOrderedIndexes=1024

it only worked on the 64 bit version. i don't know how to do the exact maths regarding memory consuption, but i figure that with 2GB ndbd and 200MB index, i would be pretty close to 3GB.

i've spent approximately 3 weeks on this issue and i think my tests are pretty relevant regarding this memory allocation issue. this may be worth pursuing on your side. if this is true, perhaps the 6.3.17 32bit version should also give memory allocation failures for data above 3GB.

let me know if i can help you with anything else.

Hello,

i am sorry, but it is only repeatable using our huge data.
So I cannot give you a simple testcase.
The entironment of the cluster is the same as mentionned in bug#38264.

Best wishes
Michael

Does this still happen with latest version?

Hello,

I'm sorry, but we don't use the Cluster Storage Engine anymore for the
mentionned project. So no further informations or tests are possible.

Best wishes
Michael