MySQL Bugs: #22578: cluster crash on simple query/spontaniously

Bug #22578	cluster crash on simple query/spontaniously
Submitted:	22 Sep 2006 3:39	Modified:	26 Oct 2006 16:43
Reporter:	Matt Wlazlo	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S3 (Non-critical)
Version:	5.1.11	OS:	Linux (Linux 2.6.17-xenU)
Assigned to:		CPU Architecture:	Any
Tags:	cluster

Description:
Hi,

I've made a cluster using the beta mysql from tarball mysql-5.1.11-beta-linux-i686-glibc23.tar.gz.

I'm getting crashes that seem almost spontanious. Sometimes it seems stable but a simple query such as select * from t; seems to set it off. Mostly though it crashes automatically :-)

mysqld is crashing on both of the SQL nodes. As suggested I've made a resolved the trace (note that is seems to be crashing in a few different places - not sure)

--- stack1 ---
0x81d02a8 handle_segfault + 356
0xa24420 (?)
(nil)
0x84302a8 _ZN14NdbEventBuffer25complete_outof_order_gcisEv + 0
0x841d3a4 _ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr + 3464
0x841c231 _ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr + 33
0x847203a _Z7executePvP12SignalHeaderhPjP16LinearSectionPtr + 982
0x847ad4c _ZN19TransporterRegistry6unpackEPjjt7IOState + 956
0x847959b _ZN19TransporterRegistry14performReceiveEv + 447
0x8472575 _ZN17TransporterFacade17threadMainReceiveEv + 269
0x847245f runReceiveResponse_C + 27
0x8462584 ndb_thread_wrapper + 76
0xa02341 (?)
0x8284ee (?)
--- /stack1 ---

--- stack2 ---
0x81da8fe handle_segfault + 368
0x812420 (?)
(nil)
0x846ab82 _ZN14NdbEventBuffer24execSUB_GCP_COMPLETE_REPEPK17SubGcpCompleteRep + 0
0x846ac38 _ZN14NdbEventBuffer24execSUB_GCP_COMPLETE_REPEPK17SubGcpCompleteRep + 182
0x84566a5 _ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr + 2713
0x84557fd _ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr + 33
0x84bb4cb _ZN17TransporterFacade8for_eachEP12NdbApiSignalP16LinearSectionPtr + 169
0x84ba5fd _Z7executePvP12SignalHeaderhPjP16LinearSectionPtr + 1057
0x8495881 _ZN19TransporterRegistry6unpackEPjjt7IOState + 481
0x8493d2e _ZN19TransporterRegistry14performReceiveEv + 194
0x84baafc _ZN17TransporterFacade17threadMainReceiveEv + 180
0x84baa3d runReceiveResponse_C + 27
0x84acde5 ndb_thread_wrapper + 130
0x764341 (?)
0x3414ee (?)
--- stack2 ---

My mgm configuration is:
[NDBD DEFAULT]
NoOfReplicas=3
DataMemory=600MB
IndexMemory=200MB
MaxNoOfOrderedIndexes=250
MaxNoOfUniqueHashIndexes=250
MaxNoOfAttributes=4000

[MYSQLD DEFAULT]

[NDB_MGMD DEFAULT]

[TCP DEFAULT]

# Managment Server
[NDB_MGMD]
Id=1
HostName=10.30.3.89             # Hostname or IP address of MGM node

[NDB_MGMD]
Id=2
HostName=10.30.3.90             # Hostname or IP address of MGM node

# Storage Engines
[NDBD]
Id=3
HostName=10.30.3.93             # the IP of the FIRST SERVER
DataDir=/var/lib/mysql-cluster

[NDBD]
Id=4
HostName=10.30.3.94             # the IP of the SECOND SERVER
DataDir=/var/lib/mysql-cluster

# 2 MySQL Clients
[MYSQLD]
Id=5
HostName=10.30.3.93

[MYSQLD]
Id=6
HostName=10.30.3.94

[NDBD]
HostName=10.30.3.95
DataDir=/var/lib/mysql-cluster

ndbd's log contains a lot of:
--- ndbd1 ---
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 2
sent SUBSCRIBE(11) to node 6, req_nodeid: 6  senderData: 36
sent SUBSCRIBE(11) to node 5, req_nodeid: 6  senderData: 36
sent SUBSCRIBE(11) to node 6, req_nodeid: 5  senderData: 36
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 1
sent UNSUBSCRIBE(12) to node 5, req_nodeid: 6  senderData: 36
sent UNSUBSCRIBE(12) to node 6, req_nodeid: 5  senderData: 36
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 1
sent SUBSCRIBE(11) to node 6, req_nodeid: 6  senderData: 36
reportAllSubscribers  subPtr.i: 0  subPtr.p->n_subscribers: 2
sent SUBSCRIBE(11) to node 5, req_nodeid: 5  senderData: 36
--- /ndbd1 ---

interesting, this only seems to happen on node 3.

A snippet of the log from mysqld is attached. No interaction with mysqld was required to generate it.

Ideas anyone?
I'm a bit of a MySQL newbie so apologies in advanced if I've missed something.

Cheers,
Matt.

How to repeat:
The above configuration seems to break things. I've not done anything special other than create a table in test:

create table t(i int) ENGINE=NDBCLUSTER;

Suggested fix:
No idea where to start looking as yet.

Mysqld log

Attachment: mysqld.log (text/x-log), 4.62 KiB.

C++filt resolved stack traces

Stack #1:

(nil)
0x84302a8 NdbEventBuffer::complete_outof_order_gcis() + 0
0x841d3a4 Ndb::handleReceivedSignal(NdbApiSignal*, LinearSectionPtr*) +
3464
0x841c231 Ndb::executeMessage(void*, NdbApiSignal*, LinearSectionPtr*) + 33
0x847203a execute(void*, SignalHeader*, unsigned char, unsigned int*, LinearSectionPtr*) + 982
0x847ad4c TransporterRegistry::unpack(unsigned int*, unsigned int, unsigned short, IOState) + 956
0x847959b TransporterRegistry::performReceive() + 447
0x8472575 TransporterFacade::threadMainReceive() + 269
0x847245f runReceiveResponse_C + 27
0x8462584 ndb_thread_wrapper + 76
0xa02341 (?)

Stack #2:

(nil)
0x846ab82 NdbEventBuffer::execSUB_GCP_COMPLETE_REP(SubGcpCompleteRep const*) +
0
0x846ac38 NdbEventBuffer::execSUB_GCP_COMPLETE_REP(SubGcpCompleteRep const*) +
182
0x84566a5 Ndb::handleReceivedSignal(NdbApiSignal*, LinearSectionPtr*) +
2713
0x84557fd Ndb::executeMessage(void*, NdbApiSignal*, LinearSectionPtr*) + 33
0x84bb4cb TransporterFacade::for_each(NdbApiSignal*, LinearSectionPtr*) +
169
0x84ba5fd execute(void*, SignalHeader*, unsigned char, unsigned int*, LinearSectionPtr*) + 1057
0x8495881 TransporterRegistry::unpack(unsigned int*, unsigned int, unsigned short, IOState) + 481
0x8493d2e TransporterRegistry::performReceive() + 194
0x84baafc TransporterFacade::threadMainReceive() + 180
0x84baa3d runReceiveResponse_C + 27
0x84acde5 ndb_thread_wrapper + 130
0x764341 (?)

Error log snippets:

INVALID SUB_GCP_COMPLETE_REP
gci: 4194
sender: 1010004
count: 3
bucket count: 2
nodes: 3
mysqld got signal 6;

INVALID SUB_GCP_COMPLETE_REP
gci: 4209
sender: 1010004
count: 3
bucket count: 2
nodes: 3
mysqld got signal 6;

This is most certainly replication related.
  (but replication infrastructure is also used for distributing
   ddl among mysqld)

If you can turn of replication, this will most likely go away.

Otherwise, I think this has been fixed in upcoming 5.1.12...

Questions:
* what order do you start cluster/mysqld ?
* What does "Linux 2.6.17-xenU" mean, do you run using Xen ?
  (we have to my knowledge never tried this...)

/Jonas

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".