Bug #32322 | Cluster Backup ends in Segmentation fault and lost node | ||
---|---|---|---|
Submitted: | 13 Nov 2007 15:39 | Modified: | 12 Oct 2009 8:28 |
Reporter: | Michael Neubert | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | mysql-5.1-telco-6.2 | OS: | Linux |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
Tags: | mysql-5.1.22 ndb-6.2.7, segmentation fault |
[13 Nov 2007 15:39]
Michael Neubert
[13 Nov 2007 15:54]
Hartmut Holzgraefe
Please provide the ndb_4_error.log and ndb_4_trace.log.17 for analysis. If a core file was written add it, too. It is unlikely that a core file was generated though, to enable core files for ndbd processes you need to start them with the --core-file option and need to make sure that core files may be created without size restrictions using the "ulimit -c unlimited" shell command right before starting the ndbd process ...
[13 Nov 2007 16:28]
Michael Neubert
Trace and error log are now uploaded. Unfortunately core file is not available, because of the reasons mentioned.
[13 Nov 2007 17:24]
Jonas Oreland
1) Is it repeatable? 2) Can we then get schema+data? /Jonas
[15 Nov 2007 14:08]
Michael Neubert
At the moment the segmentation fault is not repeatable. After restartung the node Backups seems to work as expected. If the error occurs again, I will inform you within this thread. I'm sorry, but data (over 16 GB) and schema (many databases) cannot be transfered.
[19 Nov 2007 15:52]
Michael Neubert
Hello, the segmentation fault occurred again. After restarting die broken node (2 times) and starting a new backup (2 times), there were the same problems. This time there was also a new kind of error code (see below). Time: Saturday 17 November 2007 - 03:40:24 Status: Temporary error, restart node Message: Error OS signal received (Internal error, programming error or missing error message, please report a bug) Error: 6000 Error data: Signal 11 received; Segmentation fault Error object: main.cpp Program: ndbd Pid: 18484 Trace: /var/log/mysql/ndb_2_trace.log.11 Version: mysql-5.1.22 ndb-6.2.7-beta ***EOM*** Time: Saturday 17 November 2007 - 04:57:45 Status: Temporary error, restart node Message: Assertion (Internal error, programming error or missing error message, please report a bug) Error: 2301 Error data: ArrayPool<T>::getPtr Error object: ../../../../../storage/ndb/src/kernel/vm/ArrayPool.hpp line: 349 (block: DBTUP) Program: ndbd Pid: 416 Trace: /var/log/mysql/ndb_2_trace.log.12 Version: mysql-5.1.22 ndb-6.2.7-beta See also the new attached files. I think I have found the reason for the problems. There were tables with BLOB cloumns in it. After deleting those tables, the Backup works fine again. Beste wishes Michael
[22 Nov 2007 18:53]
Michael Neubert
What Feedback do you need?
[21 Jul 2008 14:26]
Michael Neubert
Hello, the problem still exists after upgrading to NDB 6.2.14. Best wishes Michael
[28 Oct 2008 10:44]
li zhou
Is that repeatable only using your huge data? Can you provide a simple way to repeat the bug and provide schema and data. Is the entironment of cluster the same with bug#38264?
[25 Nov 2008 19:35]
Ionut Dumitru
i am encountering a similar behaviour using cluster 6.3.17 with approximately 10 gigs of data. i get the issue when trying to import a new database to the system. it doesn't really matter which database, it seems it's related to reaching a certain amount of size. at some point in the import, one of the nodes dies (i have a 10 ndb setup). i'm running the cluster on centos 5.2 32 bit, kernel 2.6 with PAE. can the 32 bit be an issue in handling the large ram (i allocated 6GB to ndb). do you think a 64 bit OS would be better?
[25 Nov 2008 19:39]
Ionut Dumitru
also, another strange behaviour that i encountered while setting up the cluster: i initially had 4gigs of ram per machine and allocated 2G for each ndb. then we added an extra 4gigs to each box so we had 8 in total. when doing a complete restart to update the new configuration we encountered startup failures with the following values (we had about 2.7G data spread across the nodes in 2 replica at that time): - setting memory to 5G per ndb ... always fails to start - setting memory to 3G per ndb ... always fails to start - setting memory to 6G per ndb ... always starts correctly so, again: can there be any issue caused by using a 32 bit architecture?
[26 Nov 2008 3:14]
li zhou
Please try to reproduct in 64 bits box. If you get the same error, please attach the log/trace file and config file.
[29 Nov 2008 0:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[2 Dec 2008 22:47]
Ionut Dumitru
hey so after some tests i think i can put the blame on the 32 bit version. i have installed the 64 bit rpms and everything runs smooth. other then some overloaded redo log files while importing the databases, i didn't encounter any error. i'm not really sure i'm right but from what i understand, although a PAE kernel can handle more then 3GB of RAM, a single process is limited to max 3GB. that might be the issue. from the tests i ran. let me detail a bit the previous steps i did that may sustain this argument: a) the problem: random seg fault errors with configuration allowing 6GB of RAM to each ndbd. b) what i did before reaching the 64 bit version: - did a lot of attempts with the 6GB configuration. restarted with --initial a lot of times and the error came out in a random pattern ... sometimes it allowed me to do a lot of imorts, sometimes it would kick in real soon. - the hint that i remembered related to the 3GB limit was that when i managed to get the most data into the cluster, the data usage on the nodes was approximatly 2.1GB and about 100med of index. - i reverted to mysql cluster 5.0.67. what happened in this case is that it never allowed me to allocate more than 2.2 GB of RAM per node (i used 2gb ndbd, 200 index). it kept dieing during startup with "memory allocation failure". this is when i connected the 2 pieces and thought that i might be dealing with the 3GB limit. - i reinstalled the boxes with 64bit OS and than put cluster 6.3.17 again. it worked like a charm. tried several configuration without any problems. it swallowed all my data (approximately 5GB per node) and is very stable with restarts. here are the config.ini variables that i used across all the tests: NoOfReplicas=2 DataMemory=6000M IndexMemory=500M MaxNoOfConcurrentOperations = 300000 MaxNoOfAttributes=9000 MaxNoOfTables=2000 MaxNoOfOrderedIndexes=1024 it only worked on the 64 bit version. i don't know how to do the exact maths regarding memory consuption, but i figure that with 2GB ndbd and 200MB index, i would be pretty close to 3GB. i've spent approximately 3 weeks on this issue and i think my tests are pretty relevant regarding this memory allocation issue. this may be worth pursuing on your side. if this is true, perhaps the 6.3.17 32bit version should also give memory allocation failures for data above 3GB. let me know if i can help you with anything else.
[9 Jan 2009 18:14]
Michael Neubert
Hello, i am sorry, but it is only repeatable using our huge data. So I cannot give you a simple testcase. The entironment of the cluster is the same as mentionned in bug#38264. Best wishes Michael
[19 May 2009 12:29]
Jonathan Miller
Does this still happen with latest version?
[20 May 2009 10:54]
Michael Neubert
Hello, I'm sorry, but we don't use the Cluster Storage Engine anymore for the mentionned project. So no further informations or tests are possible. Best wishes Michael