Bug #23297 | Node startup fails when using NPTL | ||
---|---|---|---|
Submitted: | 14 Oct 2006 19:38 | Modified: | 19 Nov 2007 18:09 |
Reporter: | Hamid Badiozamani | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 5.0.45 | OS: | Linux (Gentoo) |
Assigned to: | CPU Architecture: | Any | |
Tags: | NPTL |
[14 Oct 2006 19:38]
Hamid Badiozamani
[14 Oct 2006 19:39]
Hamid Badiozamani
Trace log for NPTL node
Attachment: ndb_12_trace.log.20 (text/plain), 25.79 KiB.
[14 Oct 2006 19:42]
Hamid Badiozamani
A brief discussion of the problem was posted on the MySQL Cluster forums: http://forums.mysql.com/read.php?25,119963,119963#msg-119963 Of note would be the system calls taking place at the time the node was hanging. From the forum thread: # strace -p 8154 Process 8154 attached - interrupt to quit waitpid(8155, <unfinished ...> Process 8154 detached So, I did an strace on the child thread and this was the result: # strace -p 8155 Process 8155 attached - interrupt to quit futex(0x8654a94, FUTEX_WAIT, 1, NULL) = -1 EINTR (Interrupted system call) PANIC: attached pid 8155 exited Process 8155 detached It hung there for a long time before the "= -1 EINTR" came up, so I'm assuming it's hanging on that system call.
[16 Oct 2006 11:46]
Valeriy Kravchuk
Thank you for a problem report. Please, try to repeat with a newer version, 5.0.26, and inform about the results.
[17 Oct 2006 4:23]
Hamid Badiozamani
The exact same outcome is evident after upgrading to 5.0.26. The trace/error/output logs follow.
[17 Oct 2006 4:24]
Hamid Badiozamani
Error log
Attachment: ndb_13_error.log (application/octet-stream, text), 568 bytes.
[17 Oct 2006 4:24]
Hamid Badiozamani
Output log
Attachment: ndb_13_out.log (application/octet-stream, text), 911 bytes.
[17 Oct 2006 4:24]
Hamid Badiozamani
Trace log
Attachment: ndb_13_trace.log.1 (application/x-stuffit, text), 25.79 KiB.
[19 Oct 2006 12:47]
Jonas Oreland
Hi, 1) Are you running short on RAM on machine(s)? 2) Are you using LockPagesInMemory? Then only thing that I could find/guess would be if you answered yes to 1 (and maybe 2)
[22 Oct 2006 4:35]
Hamid Badiozamani
No, I'm running on 4GB of memory and the cluster uses 2048MB of that. And I'm not familiar with LockPagesInMemory and therefore I'm not using it. Has the product been tested with NPTL on Linux? Are others able to run the cluster using NPTL?
[3 Jan 2007 14:26]
Hartmut Holzgraefe
Hi, what kind of CPU are you running this on? And what distribution are you using? I'd like to reproduce the exact glibc upgrade situation for this bug as we've not seen any similar NTPL related issues yet ...
[3 Jan 2007 18:01]
Hamid Badiozamani
Sure, the distribution I'm using is Gentoo. All machines used are identical. The cluster capacity usage hovers around 88%. There are 8 data nodes total. To reproduce, start up the cluster with the following libc: GNU C Library stable release version 2.3.6, by Roland McGrath et al. Copyright (C) 2005 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 3.3.6 (Gentoo 3.3.6, ssp-3.3.6-1.0, pie-8.7.8). Compiled on a Linux 2.6.11 system on 2006-08-25. Available extensions: GNU libio by Per Bothner crypt add-on version 2.1 by Michael Glad and others linuxthreads-0.10 by Xavier Leroy The C stubs add-on version 2.1.2. GNU Libidn by Simon Josefsson BIND-8.2.3-T5B libthread_db work sponsored by Alpha Processor Inc NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. Then upgrade to this libc: GNU C Library development release version 2.4, by Roland McGrath et al. Copyright (C) 2006 Free Software Foundation, Inc. This is free software; see the source for copying conditions. There is NO warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. Compiled by GNU CC version 4.1.1 (Gentoo 4.1.1). Compiled on a Linux 2.6.17 system on 2006-10-07. Available extensions: The C stubs add-on version 2.1.2. crypt add-on version 2.1 by Michael Glad and others GNU Libidn by Simon Josefsson GNU libio by Per Bothner NIS(YP)/NIS+ NSS modules 0.19 by Thorsten Kukuk Native POSIX Threads Library by Ulrich Drepper et al Support for some architectures added on, not maintained in glibc core. BIND-8.2.3-T5B Thread-local storage support included. For bug reporting instructions, please see: <http://www.gnu.org/software/libc/bugs.html>. From /proc/meminfo MemTotal: 4091800 kB From /proc/cpuinfo processor : 0 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 stepping : 10 cpu MHz : 1992.263 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3991.79 processor : 1 vendor_id : AuthenticAMD cpu family : 15 model : 5 model name : AMD Opteron(tm) Processor 246 stepping : 10 cpu MHz : 1992.263 cache size : 1024 KB fdiv_bug : no hlt_bug : no f00f_bug : no coma_bug : no fpu : yes fpu_exception : yes cpuid level : 1 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx mmxext lm 3dnowext 3dnow bogomips : 3983.84 From config.ini: [NDBD DEFAULT] NoOfReplicas=2 DataMemory=2040M IndexMemory=512M MaxNoOfOrderedIndexes=250 MaxNoOfUniqueHashIndexes=250 MaxNoOfConcurrentOperations=250000 MaxNoOfConcurrentTransactions=8192 MaxNoOfConcurrentIndexOperations=65536 DataDir=/opt/mysql/ndb [TCP DEFAULT] portnumber=2202 SendBufferMemory=4M ReceiveBufferMemory=2M From ndb_mgm: ndb_mgm> show Connected to Management Server at: xxx.xxx.com:1186 Cluster Configuration --------------------- [ndbd(NDB)] 8 node(s) id=12 @192.168.0.210 (Version: 5.0.27, Nodegroup: 0, Master) id=13 @192.168.0.209 (Version: 5.0.27, Nodegroup: 0) id=14 @192.168.0.212 (Version: 5.0.27, Nodegroup: 1) id=15 @192.168.0.215 (Version: 5.0.27, Nodegroup: 1) id=16 @192.168.0.217 (Version: 5.0.27, Nodegroup: 2) id=17 @192.168.0.218 (Version: 5.0.27, Nodegroup: 2) id=18 @192.168.0.219 (Version: 5.0.27, Nodegroup: 3) id=19 @192.168.0.220 (Version: 5.0.27, Nodegroup: 3) [ndb_mgmd(MGM)] 1 node(s) id=2 @192.168.0.211 (Version: 5.0.27) [mysqld(API)] 5 node(s) id=22 @192.168.0.209 (Version: 5.0.27) id=24 @192.168.0.211 (Version: 5.0.27) id=26 @192.168.0.213 (Version: 5.0.26) id=27 @192.168.0.214 (Version: 5.0.26) id=29 @192.168.0.216 (Version: 5.0.26)
[23 May 2007 9:32]
Valeriy Kravchuk
Please, try to repeat with a newer version, 5.0.41. I ncase of the same problem, please, send the results of: getconf GNU_LIBC_VERSION getconf GNU_LIBPTHREAD_VERSION
[23 Jun 2007 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[24 Jun 2007 3:00]
Hamid Badiozamani
Yes, the problem still persists. ********************************** Time: Saturday 23 June 2007 - 19:58:24 Status: Temporary error, restart node Message: WatchDog terminate, internal error or massive overload on the machine running this node (Internal error, programming error or missing error message, please report a bug) Error: 6050 Error data: Job Handling Error object: WatchDog.cpp Program: ndbd Pid: 24533 Trace: /opt/mysql/ndb/ndb_13_trace.log.2 Version: Version 5.0.40 ***EOM*** Here are the results of the LIBC and LIBPTHREAD versions: ~ # getconf GNU_LIBC_VERSION glibc 2.4 ~ # getconf GNU_LIBPTHREAD_VERSION NPTL 2.4
[19 Nov 2007 11:30]
Bogdan Kecman
I cannot repeat this neather on 2.6 nor 2.7 GLIBC/NPTL neather using 5.0.41 nor 5.0.45 I believe solution is to upgrade system to newer glibc
[19 Nov 2007 11:36]
Bogdan Kecman
cannot repeat on gentoo with glibc 2.6 and mysql 5.0.41 nor mysql 5.0.45 cannot repeat on fedora 7 with glibc 2.6 and mysql 5.0.41 nor mysql 5.0.45 cannot repeat on fedora 8 with glibc 2.7 and mysql 5.0.41 nor mysql 5.0.45
[19 Nov 2007 18:09]
Hamid Badiozamani
Happens when one of the nodes is using the new glibc (NPTL) and others are using the old glibc (linuxthreads). Solution is to use a homogeneous GLIBC across all machines.