Bug #22578 cluster crash on simple query/spontaniously
Submitted: 22 Sep 2006 3:39 Modified: 26 Oct 2006 16:43
Reporter: Matt Wlazlo Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S3 (Non-critical)
Version:5.1.11 OS:Linux (Linux 2.6.17-xenU)
Assigned to: CPU Architecture:Any
Tags: cluster

[22 Sep 2006 3:39] Matt Wlazlo
Description:
Hi,

I've made a cluster using the beta mysql from tarball mysql-5.1.11-beta-linux-i686-glibc23.tar.gz.

I'm getting crashes that seem almost spontanious. Sometimes it seems stable but a simple query such as select * from t; seems to set it off. Mostly though it crashes automatically :-)

mysqld is crashing on both of the SQL nodes. As suggested I've made a resolved the trace (note that is seems to be crashing in a few different places - not sure)

--- stack1 ---
0x81d02a8 handle_segfault + 356
0xa24420 (?)
(nil)
0x84302a8 _ZN14NdbEventBuffer25complete_outof_order_gcisEv + 0
0x841d3a4 _ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr + 3464
0x841c231 _ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr + 33
0x847203a _Z7executePvP12SignalHeaderhPjP16LinearSectionPtr + 982
0x847ad4c _ZN19TransporterRegistry6unpackEPjjt7IOState + 956
0x847959b _ZN19TransporterRegistry14performReceiveEv + 447
0x8472575 _ZN17TransporterFacade17threadMainReceiveEv + 269
0x847245f runReceiveResponse_C + 27
0x8462584 ndb_thread_wrapper + 76
0xa02341 (?)
0x8284ee (?)
--- /stack1 ---

--- stack2 ---
0x81da8fe handle_segfault + 368
0x812420 (?)
(nil)
0x846ab82 _ZN14NdbEventBuffer24execSUB_GCP_COMPLETE_REPEPK17SubGcpCompleteRep + 0
0x846ac38 _ZN14NdbEventBuffer24execSUB_GCP_COMPLETE_REPEPK17SubGcpCompleteRep + 182
0x84566a5 _ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr + 2713
0x84557fd _ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr + 33
0x84bb4cb _ZN17TransporterFacade8for_eachEP12NdbApiSignalP16LinearSectionPtr + 169
0x84ba5fd _Z7executePvP12SignalHeaderhPjP16LinearSectionPtr + 1057
0x8495881 _ZN19TransporterRegistry6unpackEPjjt7IOState + 481
0x8493d2e _ZN19TransporterRegistry14performReceiveEv + 194
0x84baafc _ZN17TransporterFacade17threadMainReceiveEv + 180
0x84baa3d runReceiveResponse_C + 27
0x84acde5 ndb_thread_wrapper + 130
0x764341 (?)
0x3414ee (?)
--- stack2 ---

My mgm configuration is:
[NDBD DEFAULT]
NoOfReplicas=3
DataMemory=600MB
IndexMemory=200MB
MaxNoOfOrderedIndexes=250
MaxNoOfUniqueHashIndexes=250
MaxNoOfAttributes=4000

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]

[TCP DEFAULT]

# Managment Server
[NDB_MGMD]
Id=1
HostName=10.30.3.89             # Hostname or IP address of MGM node

[NDB_MGMD]
Id=2
HostName=10.30.3.90             # Hostname or IP address of MGM node

# Storage Engines
[NDBD]
Id=3
HostName=10.30.3.93             # the IP of the FIRST SERVER
DataDir=/var/lib/mysql-cluster

[NDBD]
Id=4
HostName=10.30.3.94             # the IP of the SECOND SERVER
DataDir=/var/lib/mysql-cluster

# 2 MySQL Clients
[MYSQLD]
Id=5
HostName=10.30.3.93

[MYSQLD]
Id=6
HostName=10.30.3.94

[NDBD]
HostName=10.30.3.95
DataDir=/var/lib/mysql-cluster

ndbd's log contains a lot of:
--- ndbd1 ---
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 2
sent SUBSCRIBE(11) to node 6, req_nodeid: 6  senderData: 36
sent SUBSCRIBE(11) to node 5, req_nodeid: 6  senderData: 36
sent SUBSCRIBE(11) to node 6, req_nodeid: 5  senderData: 36
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 1
sent UNSUBSCRIBE(12) to node 5, req_nodeid: 6  senderData: 36
sent UNSUBSCRIBE(12) to node 6, req_nodeid: 5  senderData: 36
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 1
sent SUBSCRIBE(11) to node 6, req_nodeid: 6  senderData: 36
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 2
sent SUBSCRIBE(11) to node 5, req_nodeid: 5  senderData: 36
--- /ndbd1 ---

interesting, this only seems to happen on node 3.

A snippet of the log from mysqld is attached. No interaction with mysqld was required to generate it.

Ideas anyone?
I'm a bit of a MySQL newbie so apologies in advanced if I've missed something.

Cheers,
Matt.

How to repeat:
The above configuration seems to break things. I've not done anything special other than create a table in test:

create table t(i int) ENGINE=NDBCLUSTER;

Suggested fix:
No idea where to start looking as yet.
[22 Sep 2006 3:39] Matt Wlazlo
Mysqld log

Attachment: mysqld.log (text/x-log), 4.62 KiB.

[22 Sep 2006 11:04] Hartmut Holzgraefe
C++filt resolved stack traces

Stack #1:

(nil)
0x84302a8 NdbEventBuffer::complete_outof_order_gcis() + 0
0x841d3a4 Ndb::handleReceivedSignal(NdbApiSignal*, LinearSectionPtr*) +
3464
0x841c231 Ndb::executeMessage(void*, NdbApiSignal*, LinearSectionPtr*) + 33
0x847203a execute(void*, SignalHeader*, unsigned char, unsigned int*, LinearSectionPtr*) + 982
0x847ad4c TransporterRegistry::unpack(unsigned int*, unsigned int, unsigned short, IOState) + 956
0x847959b TransporterRegistry::performReceive() + 447
0x8472575 TransporterFacade::threadMainReceive() + 269
0x847245f runReceiveResponse_C + 27
0x8462584 ndb_thread_wrapper + 76
0xa02341 (?)

Stack #2:

(nil)
0x846ab82 NdbEventBuffer::execSUB_GCP_COMPLETE_REP(SubGcpCompleteRep const*) +
0
0x846ac38 NdbEventBuffer::execSUB_GCP_COMPLETE_REP(SubGcpCompleteRep const*) +
182
0x84566a5 Ndb::handleReceivedSignal(NdbApiSignal*, LinearSectionPtr*) +
2713
0x84557fd Ndb::executeMessage(void*, NdbApiSignal*, LinearSectionPtr*) + 33
0x84bb4cb TransporterFacade::for_each(NdbApiSignal*, LinearSectionPtr*) +
169
0x84ba5fd execute(void*, SignalHeader*, unsigned char, unsigned int*, LinearSectionPtr*) + 1057
0x8495881 TransporterRegistry::unpack(unsigned int*, unsigned int, unsigned short, IOState) + 481
0x8493d2e TransporterRegistry::performReceive() + 194
0x84baafc TransporterFacade::threadMainReceive() + 180
0x84baa3d runReceiveResponse_C + 27
0x84acde5 ndb_thread_wrapper + 130
0x764341 (?)

Error log snippets:

INVALID SUB_GCP_COMPLETE_REP
gci: 4194
sender: 1010004
count: 3
bucket count: 2
nodes: 3
mysqld got signal 6;

INVALID SUB_GCP_COMPLETE_REP
gci: 4209
sender: 1010004
count: 3
bucket count: 2
nodes: 3
mysqld got signal 6;
[26 Sep 2006 16:43] Jonas Oreland
This is most certainly replication related.
  (but replication infrastructure is also used for distributing
   ddl among mysqld)

If you can turn of replication, this will most likely go away.

Otherwise, I think this has been fixed in upcoming 5.1.12...

Questions:
* what order do you start cluster/mysqld ?
* What does "Linux 2.6.17-xenU" mean, do you run using Xen ?
  (we have to my knowledge never tried this...)

/Jonas
[26 Oct 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".