Bug #27665 mysqld crashes when any data node is shutdown
Submitted: 5 Apr 2007 13:54 Modified: 10 Apr 2007 21:46
Reporter: Bhupinder Singh Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Cluster: Cluster (NDB) storage engine Severity:S1 (Critical)
Version:5.1.16 OS:Linux (Redhat EL 4)
Assigned to: CPU Architecture:Any
Tags: mysqldcrash ndbd shutdown

[5 Apr 2007 13:54] Bhupinder Singh
Description:
Hi 

We are trying to set up the following:
Management Node -- 1
SQL Nodes -- 4
Data Nodes -- 4
Replicas -- 4

The "mysqld" on all SQL nodes crash and restart repeatedly , if any of the Data nodes crashes or is shutdown using ndb_mgm.

Although i have tested the same config with replicas=2 and replicas=1 , and both these work fine. 

Here is the error log:

Number of processes running now: 0
070405 15:14:17  mysqld restarted
070405 15:14:17  InnoDB: Started; log sequence number 0 46409
070405 15:14:24 [Note] Starting MySQL Cluster Binlog Thread
070405 15:14:24 [Note] Recovering after a crash using mysql-bin
070405 15:14:24 [Note] Starting crash recovery...
070405 15:14:24 [Note] Crash recovery finished.
070405 15:14:24 [Note] /home1/app/mysql/5.1.16/bin/mysqld: ready for connections.
Version: '5.1.16-beta-log'  socket: '/tmp/mysql.sock'  port: 3306  MySQL Community Server (GPL)
070405 15:14:24 [Note] SCHEDULER: Loaded 0 events
INVALID SUB_GCP_COMPLETE_REP
gci: 269
sender: 1010003
count: 7
bucket count: 4294967295
nodes: 4
mysqld got signal 6;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help diagnose
the problem, but since we have already crashed, something is definitely wrong
and this may fail.

key_buffer_size=16777216
read_buffer_size=258048
max_used_connections=0
max_connections=151
threads_connected=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_connections = 131746 K
bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

thd: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
Cannot determine thread, fp=0xb7f96bf8, backtrace may not be correct.
Stack range sanity check OK, backtrace follows:
0x81f2770
0xffffe410
0xb7e03ea3
0x8447481
0x8434544
0x84333c9
0x84908fa
0x8499a24
0x8498247
0x8490e35
0x8490d1f
0x8480a6c
0xb7f6d34b
0xb7e9765e
New value of fp=(nil) failed sanity check, terminating stack trace!
Please read http://dev.mysql.com/doc/mysql/en/using-stack-trace.html and follow instructions on how to resolve the stack trace. Resolved
stack trace is much more helpful in diagnosing the problem, so please do
resolve it
The manual page at http://www.mysql.com/doc/en/Crashing.html contains
information that should help you find out what is causing the crash.

=======================================================================
Here is the Stack Trace:

mysql@ORM-1:~/5.1.16/stack> resolve_stack_dump -s mysqld.sym -n mysqld.stack
0x81f2770 handle_segfault + 356
0xffffe410 _end + -141731648
0xb7e03ea3 _end + -1351765165
0x8447481 _ZN14NdbEventBuffer24execSUB_GCP_COMPLETE_REPEPK17SubGcpCompleteRep + 701
0x8434544 _ZN3Ndb20handleReceivedSignalEP12NdbApiSignalP16LinearSectionPtr + 3400
0x84333c9 _ZN3Ndb14executeMessageEPvP12NdbApiSignalP16LinearSectionPtr + 33
0x84908fa _Z7executePvP12SignalHeaderhPjP16LinearSectionPtr + 982
0x8499a24 _ZN19TransporterRegistry6unpackEPjjt7IOState + 956
0x8498247 _ZN19TransporterRegistry14performReceiveEv + 447
0x8490e35 _ZN17TransporterFacade17threadMainReceiveEv + 269
0x8490d1f runReceiveResponse_C + 27
0x8480a6c ndb_thread_wrapper + 76
0xb7f6d34b _end + -1350285317
0xb7e9765e _end + -1351161074

========================================================================

The config file looks like this

Configuring the Storage and SQL Nodes

-- my.cnf for "DATA" node and "SQL" node 

vi /etc/my.cnf

# Options for mysqld process:
[MYSQLD]
ndbcluster                      # run NDB engine
ndb-connectstring=172.16.15.89  # location of MGM node

# Options for ndbd process:
[MYSQL_CLUSTER]
ndb-connectstring=172.16.15.89  # location of MGM node

[client]
port=3306
socket=/tmp/mysql.sock

[mysqld]
port=3306
socket=/tmp/mysql.sock
key_buffer_size=16M
max_allowed_packet=8M

[mysqldump]
quick
-- 

---------------------------------------------------------------

---------------------------------------------------------------

Configuring the Management Node

mkdir /home1/app/mysql/5.1.16/mgmt_data
cd /home1/app/mysql/5.1.16/mgmt_data
vi mgmt_config.ini

# Options affecting ndbd processes on all data nodes:
[NDBD DEFAULT]
NoOfReplicas=2    # Number of replicas
DataMemory=80M    # How much memory to allocate for data storage
IndexMemory=18M   # How much memory to allocate for index storage
                  # For DataMemory and IndexMemory, we have used the
                  # default values. Since the "world"    database takes up
                  # only about 500KB, this should be more than enough for
                  # this example Cluster setup.

# TCP/IP options:
[TCP DEFAULT]
portnumber=2202   # This the default; however, you can use any
                  # port that is free for all the hosts in cluster
                  # Note: It is recommended beginning with MySQL 5.0 that
                  # you do not specify the portnumber at all and simply allow
                  # the default value to be used instead

# Management process options:
[NDB_MGMD]
# Hostname or IP address of MGM node
hostname=172.16.15.89           
# Directory for MGM node logfiles
datadir=/home1/app/mysql/5.1.16/mgmt_data  

# Options for data node "NDB_0":
[NDBD]
                                # (one [NDBD] section per data node)
hostname=172.16.15.70           # Hostname or IP address
# Directory for this data node's datafiles
datadir=/home1/app/mysql/5.1.16/data   

# Options for data node "NDB_1":
[NDBD]
                                # (one [NDBD] section per data node)
hostname=172.16.15.71           # Hostname or IP address
# Directory for this data node's datafiles
datadir=/home1/app/mysql/5.1.16/data   

# Options for data node "NDB_2":
[NDBD]
                                # (one [NDBD] section per data node)
hostname=172.16.15.72           # Hostname or IP address
# Directory for this data node's datafiles
datadir=/home1/app/mysql/5.1.16/data

# Options for data node "NDB_3":
[NDBD]
                                # (one [NDBD] section per data node)
hostname=172.16.15.73           # Hostname or IP address
# Directory for this data node's datafiles
datadir=/home1/app/mysql/5.1.16/data

# SQL node options:
[MYSQLD]
hostname=172.16.15.89           # Hostname or IP address

# SQL node options:
[MYSQLD]
hostname=172.16.15.87           # Hostname or IP address

# SQL node options:
[MYSQLD]
hostname=172.16.15.88           # Hostname or IP address

# SQL node options:
[MYSQLD]
hostname=172.16.15.90           # Hostname or IP address
                                
======================================================================

How to repeat:
1.Startup management node, followed by four data nodes, followed by sql nodes.
2. Shutdown any data node using 
      ndb-mgm> <id> stop

    or manually kill the ndbd process using kill -9 <process id>
3. Observe the mysqld, these get restarted repeatedly
[6 Apr 2007 20:33] Bhupinder Singh
Hi 

I mentioned the wrong OS by  mistake, it is RHEL 4
[10 Apr 2007 15:03] Hartmut Holzgraefe
verified with current 5.1bk, i failed to produce a core file for analysis though
(for unknown reasons). The unmangled stack trace from the original report looks like this:

0x81f2770 handle_segfault + 356
0xffffe410 _end + -141731648
0xb7e03ea3 _end + -1351765165
0x8447481 NdbEventBuffer::execSUB_GCP_COMPLETE_REP(SubGcpCompleteRep const*) +
701
0x8434544 Ndb::handleReceivedSignal(NdbApiSignal*, LinearSectionPtr*) +
3400
0x84333c9 Ndb::executeMessage(void*, NdbApiSignal*, LinearSectionPtr*) + 33
0x84908fa execute(void*, SignalHeader*, unsigned char, unsigned int*, LinearSectionPtr*) + 982
0x8499a24 TransporterRegistry::unpack(unsigned int*, unsigned int, unsigned short, IOState) + 956
0x8498247 TransporterRegistry::performReceive() + 447
0x8490e35 TransporterFacade::threadMainReceive() + 269
0x8490d1f runReceiveResponse_C + 27
0x8480a6c ndb_thread_wrapper + 76
0xb7f6d34b _end + -1350285317
0xb7e9765e _end + -1351161074
[10 Apr 2007 21:46] Tomas Ulin
duplicate with 18621, ndbcluster binlog is currently not functional for > 2 replicas
[10 Apr 2007 21:48] Tomas Ulin
actually, as now mysqld's always use events, cluster is not functional at all with > 2 replicas