Bug #18621 Incorrect NF handling in SUMA if > 2 replicas
Submitted: 29 Mar 2006 16:48 Modified: 19 Nov 2008 12:58
Reporter: Jonathan Miller Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.9, 5.1.10 OS:Linux (Linux 32 Bit OS)
Assigned to: Jonas Oreland
Triage: Triaged: D2 (Serious)

[29 Mar 2006 16:48] Jonathan Miller
Description:
This may be releated to http://bugs.mysql.com/bug.php?id=18595
$> perl ./cid_ndb_dd2.pl ndb07 3306 root BLANK
Database Created.
Table Space Created.
DBD::mysql::st execute failed: Can't create table 'TESTER2.t1' (errno: 155) at ./cid_ndb_dd2.pl line 139.
Create Table Error: Can't create table 'TESTER2.t1' (errno: 155) at ./cid_ndb_dd2.pl line 139.
[ndbdev@ndb07 misc]$ *** glibc detected *** /home/ndbdev/jmiller/builds/libexec/mysqld: corrupted double-linked list: 0x09df7de0 ***
======= Backtrace: =========
/lib/libc.so.6[0x8edbed]
/lib/libc.so.6[0x8eed8d]
/lib/libc.so.6(malloc+0x74)[0x8f0492]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Znaj+0x26)[0x841cee2]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN6VectorI13Gci_containerE9push_backERKS0_+0x53)[0x836502b]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN6VectorI13Gci_containerE4fillEjRS0_+0x23)[0x83651b5]
/home/ndbdev/jmiller/builds/libexec/mysqld[0x8362fab]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z11find_bucketP6VectorI13Gci_containerEy+0x8c)[0x8365254]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN14NdbEventBuffer24execSUB_GCP_COMPLETE_REPEPK17SubGcpCompleteRep+0x5b)[0x83634f1]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr+0xede)[0x8341a78]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr+0x2a)[0x8342152]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN17TransporterFacade8for_eachEP12NdbApiSignalP16LinearSectionPtr+0x87)[0x836854b]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z7executePvP12SignalHeaderhPjP16LinearSectionPtr+0x4ae)[0x83695b2]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN19TransporterRegistry6unpackEPjjt7IOState+0x28c)[0x83c0074]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN19TransporterRegistry14performReceiveEv+0x144)[0x838f410]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN17TransporterFacade17threadMainReceiveEv+0xe0)[0x8367b60]
/home/ndbdev/jmiller/builds/libexec/mysqld(runReceiveResponse_C+0x1f)[0x8367c11]
/home/ndbdev/jmiller/builds/libexec/mysqld[0x83ac020]
/lib/libpthread.so.0[0x9fdb80]
/lib/libc.so.6(__clone+0x5e)[0x9559ce]
======= Memory map: ========
00824000-00829000 r-xp 00000000 03:02 395576     /lib/libcrypt-2.3.5.so
00829000-0082a000 r-xp 00004000 03:02 395576     /lib/libcrypt-2.3.5.so
0082a000-0082b000 rwxp 00005000 03:02 395576     /lib/libcrypt-2.3.5.so
0082b000-00852000 rwxp 0082b000 00:00 0
0086d000-00887000 r-xp 00000000 03:02 391390     /lib/ld-2.3.5.so
00887000-00888000 r-xp 00019000 03:02 391390     /lib/ld-2.3.5.so
00888000-00889000 rwxp 0001a000 03:02 391390     /lib/ld-2.3.5.so
0088b000-009ae000 r-xp 00000000 03:02 391391     /lib/libc-2.3.5.so
009ae000-009b0000 r-xp 00123000 03:02 391391     /lib/libc-2.3.5.so
009b0000-009b2000 rwxp 00125000 03:02 391391     /lib/libc-2.3.5.so
009b2000-009b4000 rwxp 009b2000 00:00 0
009b6000-009b8000 r-xp 00000000 03:02 395565     /lib/libdl-2.3.5.so
009b8000-009b9000 r-xp 00001000 03:02 395565     /lib/libdl-2.3.5.so
009b9000-009ba000 rwxp 00002000 03:02 395565     /lib/libdl-2.3.5.so
009bc000-009df000 r-xp 00000000 03:02 395566     /lib/libm-2.3.5.so
009df000-009e0000 r-xp 00022000 03:02 395566     /lib/libm-2.3.5.so
009e0000-009e1000 rwxp 00023000 03:02 395566     /lib/libm-2.3.5.so
009e3000-009f5000 r-xp 00000000 03:02 1637750    /usr/lib/libz.so.1.2.2.2
009f5000-009f6000 rwxp 00011000 03:02 1637750    /usr/lib/libz.so.1.2.2.2
009f8000-00a06000 r-xp 00000000 03:02 391424     /lib/libpthread-2.3.5.so
00a06000-00a07000 r-xp 0000d000 03:02 391424     /lib/libpthread-2.3.5.so
00a07000-00a08000 rwxp 0000e000 03:02 391424     /lib/libpthread-2.3.5.so
00a08000-00a0a000 rwxp 00a08000 00:00 0
00aa1000-00ab0000 r-xp 00000000 03:02 391362     /lib/libresolv-2.3.5.so
00ab0000-00ab1000 r-xp 0000e000 03:02 391362     /lib/libresolv-2.3.5.so
00ab1000-00ab2000 rwxp 0000f000 03:02 391362     /lib/libresolv-2.3.5.so
00ab2000-00ab4000 rwxp 00ab2000 00:00 0
00c82000-00c83000 r-xp 00c82000 00:00 0
00dd0000-00dd9000 r-xp 00000000 03:02 391348     /lib/libnss_files-2.3.5.so
00dd9000-00dda000 r-xp 00008000 03:02 391348     /lib/libnss_files-2.3.5.so
00dda000-00ddb000 rwxp 00009000 03:02 391348     /lib/libnss_files-2.3.5.so
00de8000-00df1000 r-xp 00000000 03:02 395567     /lib/libgcc_s-4.0.1-20050727.so.1
00df1000-00df2000 rwxp 00009000 03:02 395567     /lib/libgcc_s-4.0.1-20050727.so.1
00fc2000-00fc6000 r-xp 00000000 03:02 391345     /lib/libnss_dns-2.3.5.so
00fc6000-00fc7000 r-xp 00003000 03:02 391345     /lib/libnss_dns-2.3.5.so
00fc7000-00fc8000 rwxp 00004000 03:02 391345     /lib/libnss_dns-2.3.5.so
04d5b000-04d6d000 r-xp 00000000 03:02 391420     /lib/libnsl-2.3.5.so
04d6d000-04d6e000 r-xp 00011000 03:02 391420     /lib/libnsl-2.3.5.so
04d6e000-04d6f000 rwxp 00012000 03:02 391420     /lib/libnsl-2.3.5.so
04d6f000-04d71000 rwxp 04d6f000 00:00 0
08048000-084e7000 r-xp 00000000 03:02 1436757    /home/ndbdev/jmiller/builds/libexec/mysqld
084e7000-08532000 rw-p 0049e000 03:02 1436757    /home/ndbdev/jmiller/builds/libexec/mysqld
08532000-0853b000 rw-p 08532000 00:00 0
09c96000-09e34000 rw-p 09c96000 00:00 0          [heap]
b4d00000-b4d21000 rw-p b4d00000 00:00 0
b4d21000-b4e00000 ---p b4d21000 00:00 0
b4e0d000-b4e40000 rw-p b4e0d000 00:00 0
b4e40000-b4e41000 ---p b4e40000 00:00 0
b4e41000-b4e71000 rw-p b4e41000 00:00 0
b4e72000-b4e73000 ---p b4e72000 00:00 0
b4e73000-b4ed6000 rw-p b4e73000 00:00 0
b4ed6000-b4ed7000 ---p b4ed6000 00:00 0
b4ed7000-b5708000 rw-p b4ed7000 00:00 0
b5708000-b5709000 ---p b5708000 00:00 0
b5709000-b573a000 rw-p b5709000 00:00 0
b573a000-b573b000 ---p b573a000 00:00 0
b573b000-b576b000 rw-p b573b000 00:00 0
b576b000-b576c000 ---p b576b000 00:00 0
b576c000-b5773000 rw-p b576c000 00:00 0
b5773000-b5774000 ---p b5773000 00:00 0
b5774000-b577b000 rw-p b5774000 00:00 0
b577b000-b577c000 ---p b577b000 00:00 0
b577c000-b5783000 rw-p b577c000 00:00 0
b5783000-b5784000 ---p b5783000 00:00 0
b5784000-b578b000 rw-p b5784000 00:00 0
b578b000-b578c000 ---p b578b000 00:00 0
b578c000-b7f9a000 rw-p b578c000 00:00 0
b7fa5000-b7fa6000 r--p 00000000 03:07 31719486   /space/var/mysql/general_log.CSV
bf891000-bf8a6000 rw-p bf891000 00:00 0          [stack]

How to repeat:
Not sure, produced from stress testing.

Note: No other tests where running during this test.
[29 Mar 2006 16:50] Jonathan Miller
Note, this same crash happened on all 5 mysqld processes.
[29 Mar 2006 16:51] Jonathan Miller
$ ~/jmiller/builds/bin/resolve_stack_dump -s /tmp/mysqld.sym -n 07.st
0x81cfe00 handle_segfault + 438
0xc82420 (?)
(nil)
0x8e822a (?)
0x8edbed (?)
0x8eed8d (?)
0x8f0492 (?)
0x841cee2 _Znaj + 38
0x836502b _ZN6VectorI13Gci_containerE9push_backERKS0_ + 83
0x83651b5 _ZN6VectorI13Gci_containerE4fillEjRS0_ + 35
0x8362fab _Z19find_bucket_chainedP6VectorI13Gci_containerEy + 153
0x8365254 _Z11find_bucketP6VectorI13Gci_containerEy + 140
0x83634f1 _ZN14NdbEventBuffer24execSUB_GCP_COMPLETE_REPEPK17SubGcpCompleteRep + 91
0x8341a78 _ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr + 3806
0x8342152 _ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr + 42
0x836854b _ZN17TransporterFacade8for_eachEP12NdbApiSignalP16LinearSectionPtr + 135
0x83695b2 _Z7executePvP12SignalHeaderhPjP16LinearSectionPtr + 1198
0x83c0074 _ZN19TransporterRegistry6unpackEPjjt7IOState + 652
0x838f410 _ZN19TransporterRegistry14performReceiveEv + 324
0x8367b60 _ZN17TransporterFacade17threadMainReceiveEv + 224
0x8367c11 runReceiveResponse_C + 31
0x83ac020 ndb_thread_wrapper + 104
0x9fdb80 (?)
0x9559ce (?)
[11 Apr 2006 7:15] Tomas Ulin
set to showstopper

see also 18905
[13 Apr 2006 11:53] Tomas Ulin
this is a 4-replica specific bug, not high customer impact, removing show stopper flag
[18 Apr 2006 18:18] Jonathan Miller
Another one that is close in time of the other that I posted.
060418  2:51:46 [ERROR] NDB: CREATE DATABASE atae: error Can't create database 'atae'; database exists 1007 1 1
*** glibc detected *** /home/ndbdev/jmiller/builds/libexec/mysqld: double free or corruption (fasttop): 0xb45503c0 ***
======= Backtrace: =========
/lib/libc.so.6[0x5f3124]
/lib/libc.so.6(__libc_free+0x77)[0x5f365f]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZdaPv+0x17)[0x8420fd3]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN17EventBufData_listD1Ev+0x36)[0x836836a]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13Gci_containerD1Ev+0x22)[0x8368642]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN6VectorI13Gci_containerE9push_backERKS0_+0x13a)[0x8368bd2]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN6VectorI13Gci_containerE4fillEjRS0_+0x23)[0x8368c75]
/home/ndbdev/jmiller/builds/libexec/mysqld[0x8366a6b]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z11find_bucketP6VectorI13Gci_containerEy+0x8c)[0x8368d14]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN14NdbEventBuffer11insertDataLEP21NdbEventOperationImplPK12SubTableDataP16LinearSectionPtr+0x59)[0x8366bd9]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr+0xfcf)[0x83455bd]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr+0x2a)[0x8345ba6]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z7executePvP12SignalHeaderhPjP16LinearSectionPtr+0xc9)[0x836cc8d]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN19TransporterRegistry6unpackEPjjt7IOState+0x28c)[0x83c40c8]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN19TransporterRegistry14performReceiveEv+0x144)[0x839307c]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN17TransporterFacade13external_pollEj+0x70)[0x836b74e]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN9PollGuard14wait_for_inputEi+0xeb)[0x836db17]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb25waitCompletedTransactionsEiiP9PollGuard+0x6c)[0x83442fc]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb10poll_transEiiP9PollGuard+0x56)[0x834440e]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN3Ndb11sendPollNdbEiii+0x66)[0x83444d8]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN14NdbTransaction14executeNoBlobsENS_8ExecTypeENS_11AbortOptionEi+0x78)[0x8350dc2]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN14NdbTransaction7executeENS_8ExecTypeENS_11AbortOptionEi+0x49)[0x8350ea1]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z20execute_no_commit_ieP13ha_ndbclusterP14NdbTransaction+0x1f)[0x833305f]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13ha_ndbcluster22read_multi_range_firstEPP18st_key_multi_rangeS1_jbP17st_handler_buffer+0x77f)[0x832df8d]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN18QUICK_RANGE_SELECT8get_nextEv+0x219)[0x8291f07]
/home/ndbdev/jmiller/builds/libexec/mysqld[0x829b452]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z12mysql_updateP3THDP13st_table_listR4ListI4ItemES6_PS4_jP8st_orderm15enum_duplicatesb+0x12c2)[0x8253bcc]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z21mysql_execute_commandP3THD+0x244b)[0x81ebb5f]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13sp_instr_stmt9exec_coreEP3THDPj+0x11)[0x830712f]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13sp_lex_keeper23reset_lex_and_exec_coreEP3THDPjbP8sp_instr+0x128)[0x8306f52]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN13sp_instr_stmt7executeEP3THDPj+0x111)[0x8309665]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN7sp_head7executeEP3THD+0x2a7)[0x8304931]
/home/ndbdev/jmiller/builds/libexec/mysqld(_ZN7sp_head17execute_procedureEP3THDP4ListI4ItemE+0x49c)[0x83057c8]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z21mysql_execute_commandP3THD+0x628a)[0x81ef99e]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z11mysql_parseP3THDPcj+0x217)[0x81f16e1]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z16dispatch_command19enum_server_commandP3THDPcj+0x746)[0x81f1f3e]
/home/ndbdev/jmiller/builds/libexec/mysqld(_Z10do_commandP3THD+0x104)[0x81f3080]
/home/ndbdev/jmiller/builds/libexec/mysqld(handle_one_connection+0x2d5)[0x81f3437]
/lib/libpthread.so.0[0x702b80]
/lib/libc.so.6(__clone+0x5e)[0x65a9ce]
[21 Apr 2006 9:06] Jonas Oreland
Lowering prio based on fact that it's 4 replica
With discussion with Omer/Jeb on irc, we concluded to keep this as a 4-replica bug report
  so I can close it when I fix it.
[21 Apr 2006 13:13] Tomas Ulin
several memory corrupting bugs have been fixed that are the likely cause of this, retesting is needed when those are merged into the main tree
[22 Apr 2006 8:15] Jonas Oreland
This is a not fixed.
I'll change title to reflect bug better.
When node fails, Suma sends incorrect SUB_GCP_COMPLETE_REP
  this can not be handled by event api
[7 Jun 2007 19:08] Stephen Cravey
I'm experiencing issues similar to 27665 which is noted as a dup of 18621. I have 3 replicas on 5.1.17. Is there a fix for this as yet (14 months later). I was sold on cluster based heavily on its ability to handle more than 2 replicas.

Thank you for your work.
[16 Jun 2008 10:09] Jon Stephens
NOTE: 

1. After discussing this issue with Tomas and others, I've updated the docs to indicate that NoOfReplicas > 2 should not currently be used. (See http://dev.mysql.com/doc/refman/5.1/en/mysql-cluster-ndbd-definition.html#mysql-cluster-pa....)

2. No OfReplicas = 2 is sufficient to provide reasonable guarantee of high availability.
[7 Nov 2008 14:45] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/58187

3057 Jonas Oreland	2008-11-07
      ndb - bug#18621 - fix >2 replicas wrt suma/ndbeventoperation
[7 Nov 2008 14:47] Bugs System
Pushed into 5.1.29-ndb-6.4.0  (revid:jonas@mysql.com-20081107145021-jvba01a2u6uzlhkm) (version source revid:jonas@mysql.com-20081107145021-jvba01a2u6uzlhkm) (pib:5)
[7 Nov 2008 14:48] Jonas Oreland
please note, this will not be fixed in earlier than 6.4
[18 Nov 2008 17:48] Jonas Oreland
as described verbally: no,
we still don't have any serious testing on >2 replicas
but atleast now, we don't have any *known* bugs
[19 Nov 2008 12:58] Jon Stephens
Documented in the NDB-6.4.0 changelog as follows:

        A data node failure when NoOfReplicas was greater than 2 caused all
        cluster SQL nodes to crash.