Bug #18621 | Incorrect NF handling in SUMA if > 2 replicas | ||
---|---|---|---|
Submitted: | 29 Mar 2006 16:48 | Modified: | 19 Nov 2008 12:58 |
Reporter: | Jonathan Miller | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S1 (Critical) |
Version: | 5.1.9, 5.1.10 | OS: | Linux (Linux 32 Bit OS) |
Assigned to: | Jonas Oreland | CPU Architecture: | Any |
[29 Mar 2006 16:48]
Jonathan Miller
[29 Mar 2006 16:50]
Jonathan Miller
Note, this same crash happened on all 5 mysqld processes.
[29 Mar 2006 16:51]
Jonathan Miller
$ ~/jmiller/builds/bin/resolve_stack_dump -s /tmp/mysqld.sym -n 07.st 0x81cfe00 handle_segfault + 438 0xc82420 (?) (nil) 0x8e822a (?) 0x8edbed (?) 0x8eed8d (?) 0x8f0492 (?) 0x841cee2 _Znaj + 38 0x836502b _ZN6VectorI13Gci_containerE9push_backERKS0_ + 83 0x83651b5 _ZN6VectorI13Gci_containerE4fillEjRS0_ + 35 0x8362fab _Z19find_bucket_chainedP6VectorI13Gci_containerEy + 153 0x8365254 _Z11find_bucketP6VectorI13Gci_containerEy + 140 0x83634f1 _ZN14NdbEventBuffer24execSUB_GCP_COMPLETE_REPEPK17SubGcpCompleteRep + 91 0x8341a78 _ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr + 3806 0x8342152 _ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr + 42 0x836854b _ZN17TransporterFacade8for_eachEP12NdbApiSignalP16LinearSectionPtr + 135 0x83695b2 _Z7executePvP12SignalHeaderhPjP16LinearSectionPtr + 1198 0x83c0074 _ZN19TransporterRegistry6unpackEPjjt7IOState + 652 0x838f410 _ZN19TransporterRegistry14performReceiveEv + 324 0x8367b60 _ZN17TransporterFacade17threadMainReceiveEv + 224 0x8367c11 runReceiveResponse_C + 31 0x83ac020 ndb_thread_wrapper + 104 0x9fdb80 (?) 0x9559ce (?)
[11 Apr 2006 7:15]
Tomas Ulin
set to showstopper see also 18905
[13 Apr 2006 11:53]
Tomas Ulin
this is a 4-replica specific bug, not high customer impact, removing show stopper flag
[18 Apr 2006 18:18]
Jonathan Miller
Another one that is close in time of the other that I posted. 060418 2:51:46 [ERROR] NDB: CREATE DATABASE atae: error Can't create database 'atae'; database exists 1007 1 1 *** glibc detected *** /home/ndbdev/jmiller/builds/libexec/mysqld: double free or corruption (fasttop): 0xb45503c0 *** ======= Backtrace: ========= /lib/libc.so.6[0x5f3124] /lib/libc.so.6(__libc_free+0x77)[0x5f365f] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZdaPv+0x17)[0x8420fd3] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN17EventBufData_listD1Ev+0x36)[0x836836a] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13Gci_containerD1Ev+0x22)[0x8368642] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN6VectorI13Gci_containerE9push_backERKS0_+0x13a)[0x8368bd2] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN6VectorI13Gci_containerE4fillEjRS0_+0x23)[0x8368c75] /home/ndbdev/jmiller/builds/libexec/mysqld[0x8366a6b] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z11find_bucketP6VectorI13Gci_containerEy+0x8c)[0x8368d14] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN14NdbEventBuffer11insertDataLEP21NdbEventOperationImplPK12SubTableDataP16LinearSectionPtr+0x59)[0x8366bd9] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr+0xfcf)[0x83455bd] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr+0x2a)[0x8345ba6] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z7executePvP12SignalHeaderhPjP16LinearSectionPtr+0xc9)[0x836cc8d] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN19TransporterRegistry6unpackEPjjt7IOState+0x28c)[0x83c40c8] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN19TransporterRegistry14performReceiveEv+0x144)[0x839307c] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN17TransporterFacade13external_pollEj+0x70)[0x836b74e] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN9PollGuard14wait_for_inputEi+0xeb)[0x836db17] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb25waitCompletedTransactionsEiiP9PollGuard+0x6c)[0x83442fc] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb10poll_transEiiP9PollGuard+0x56)[0x834440e] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb11sendPollNdbEiii+0x66)[0x83444d8] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN14NdbTransaction14executeNoBlobsENS_8ExecTypeENS_11AbortOptionEi+0x78)[0x8350dc2] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN14NdbTransaction7executeENS_8ExecTypeENS_11AbortOptionEi+0x49)[0x8350ea1] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z20execute_no_commit_ieP13ha_ndbclusterP14NdbTransaction+0x1f)[0x833305f] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13ha_ndbcluster22read_multi_range_firstEPP18st_key_multi_rangeS1_jbP17st_handler_buffer+0x77f)[0x832df8d] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN18QUICK_RANGE_SELECT8get_nextEv+0x219)[0x8291f07] /home/ndbdev/jmiller/builds/libexec/mysqld[0x829b452] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z12mysql_updateP3THDP13st_table_listR4ListI4ItemES6_PS4_jP8st_orderm15enum_duplicatesb+0x12c2)[0x8253bcc] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z21mysql_execute_commandP3THD+0x244b)[0x81ebb5f] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x11)[0x830712f] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x128)[0x8306f52] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13sp_instr_stmt7executeEP3THDPj+0x111)[0x8309665] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN7sp_head7executeEP3THD+0x2a7)[0x8304931] /home/ndbdev/jmiller/builds/libexec/mysqld(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x49c)[0x83057c8] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z21mysql_execute_commandP3THD+0x628a)[0x81ef99e] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z11mysql_parseP3THDPcj+0x217)[0x81f16e1] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x746)[0x81f1f3e] /home/ndbdev/jmiller/builds/libexec/mysqld(_Z10do_commandP3THD+0x104)[0x81f3080] /home/ndbdev/jmiller/builds/libexec/mysqld(handle_one_connection+0x2d5)[0x81f3437] /lib/libpthread.so.0[0x702b80] /lib/libc.so.6(__clone+0x5e)[0x65a9ce]
[21 Apr 2006 9:06]
Jonas Oreland
Lowering prio based on fact that it's 4 replica With discussion with Omer/Jeb on irc, we concluded to keep this as a 4-replica bug report so I can close it when I fix it.
[21 Apr 2006 13:13]
Tomas Ulin
several memory corrupting bugs have been fixed that are the likely cause of this, retesting is needed when those are merged into the main tree
[22 Apr 2006 8:15]
Jonas Oreland
This is a not fixed. I'll change title to reflect bug better. When node fails, Suma sends incorrect SUB_GCP_COMPLETE_REP this can not be handled by event api
[7 Jun 2007 19:08]
Stephen Cravey
I'm experiencing issues similar to 27665 which is noted as a dup of 18621. I have 3 replicas on 5.1.17. Is there a fix for this as yet (14 months later). I was sold on cluster based heavily on its ability to handle more than 2 replicas. Thank you for your work.
[16 Jun 2008 10:09]
Jon Stephens
NOTE: 1. After discussing this issue with Tomas and others, I've updated the docs to indicate that NoOfReplicas > 2 should not currently be used. (See http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-pa....) 2. No OfReplicas = 2 is sufficient to provide reasonable guarantee of high availability.
[7 Nov 2008 14:45]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/58187 3057 Jonas Oreland 2008-11-07 ndb - bug#18621 - fix >2 replicas wrt suma/ndbeventoperation
[7 Nov 2008 14:47]
Bugs System
Pushed into 5.1.29-ndb-6.4.0 (revid:jonas@mysql.com-20081107145021-jvba01a2u6uzlhkm) (version source revid:jonas@mysql.com-20081107145021-jvba01a2u6uzlhkm) (pib:5)
[7 Nov 2008 14:48]
Jonas Oreland
please note, this will not be fixed in earlier than 6.4
[18 Nov 2008 17:48]
Jonas Oreland
as described verbally: no, we still don't have any serious testing on >2 replicas but atleast now, we don't have any *known* bugs
[19 Nov 2008 12:58]
Jon Stephens
Documented in the NDB-6.4.0 changelog as follows: A data node failure when NoOfReplicas was greater than 2 caused all cluster SQL nodes to crash.