MySQL Bugs: #72034: MySQL Cluster: mysqld crashes with segmentation fault when joining cluster

Bug #72034	MySQL Cluster: mysqld crashes with segmentation fault when joining cluster
Submitted:	13 Mar 2014 15:22	Modified:	16 Mar 2014 20:03
Reporter:	Mario Beck	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S2 (Serious)
Version:	7.3.4	OS:	Linux (OL 6)
Assigned to:		CPU Architecture:	Any

Description:
When setting up a MySQL Cluster 7.3.4 on Oracle Linux the mysqld process crashes when joining the cluster.
If mgmd and ndbd run before, mysqld crashes immediately after starting.
If mgmd and ndbd do not run, mysqld runs fine. Once ndbd is started, mysqld crashes immediately with segmentation fault.
"How to repeat" includes my.cnf, cluster.ini and command log. After crash output of ndb_1_cluster.log

This test was done on a one node cluster (no-of-replica=1) but it happens on two nodes cluster as well.
Install package was mysql-cluster-gpl-7.3.4-linux-glibc2.5-x86_64 (Generic Linux TAR ball)
datadir was setup with scripts/mysql-install-db first.

How to repeat:
[root@olga64 cluster]# cat my.cnf 
[mysqld]
basedir	= /opt/cluster/
port		= 3306
socket		= /tmp/cluster.sock
skip-external-locking
key_buffer_size = 16K
max_allowed_packet = 1M
table_open_cache = 4
sort_buffer_size = 64K
read_buffer_size = 256K
read_rnd_buffer_size = 256K
net_buffer_length = 2K
thread_stack = 128K
log-error = /var/log/mysqld.error.log
ndbcluster
datadir=/opt/cluster/data
ndb-connectstring=192.168.0.101

[mysql_cluster]
ndb-connectstring=192.168.0.101

[root@olga64 cluster]# cat config-local.ini 
[ndb_mgmd]
hostname=192.168.0.101
NodeId=1

[ndbd default]
diskless=1
noofreplicas=1
DataMemory=2M
IndexMemory=1M
DiskPageBufferMemory=4M
StringMemory=5
MaxNoOfConcurrentOperations=1K
MaxNoOfConcurrentTransactions=500
SharedGlobalMemory=500K
LongMessageBuffer=512K
MaxParallelScansPerFragment=16
MaxNoOfAttributes=100
MaxNoOfTables=20
MaxNoOfOrderedIndexes=20

[ndbd]
hostname=192.168.0.101
datadir=/usr/local/mysql-cluster/
NodeId=3

[mysqld]
NodeId=50

[root@olga64 cluster]# bin/ndb_mgmd -f config-local.ini --initial
MySQL Cluster Management Server mysql-5.6.15 ndb-7.3.4

[root@olga64 cluster]# bin/ndbd --initial
2014-01-26 00:56:35 [ndbd] INFO     -- Angel connected to 'localhost:1186'
2014-01-26 00:56:35 [ndbd] INFO     -- Angel allocated nodeid: 3

[root@olga64 cluster]# bin/ndb_mgm -e show
Connected to Management Server at: localhost:1186
Cluster Configuration
---------------------
[ndbd(NDB)]	1 node(s)
id=3	@127.0.0.1  (mysql-5.6.15 ndb-7.3.4, Nodegroup: 0, *)

[ndb_mgmd(MGM)]	1 node(s)
id=1	@127.0.0.1  (mysql-5.6.15 ndb-7.3.4)

[mysqld(API)]	1 node(s)
id=50 (not connected, accepting connect from any host)

[root@olga64 cluster]# bin/mysqld --defaults-file=my.cnf --user=mysql
2014-01-26 00:57:02 0 [Warning] TIMESTAMP with implicit DEFAULT value is deprecated. Please use --explicit_defaults_for_timestamp server option (see documentation for more details).
2014-01-26 00:57:02 30135 [Note] Plugin 'FEDERATED' is disabled.
2014-01-26 00:57:02 30135 [Note] InnoDB: The InnoDB memory heap is disabled
2014-01-26 00:57:02 30135 [Note] InnoDB: Mutexes and rw_locks use GCC atomic builtins
2014-01-26 00:57:02 30135 [Note] InnoDB: Compressed tables use zlib 1.2.3
2014-01-26 00:57:02 30135 [Note] InnoDB: Using Linux native AIO
2014-01-26 00:57:02 30135 [Note] InnoDB: Not using CPU crc32 instructions
2014-01-26 00:57:02 30135 [Note] InnoDB: Initializing buffer pool, size = 128.0M
2014-01-26 00:57:02 30135 [Note] InnoDB: Completed initialization of buffer pool
2014-01-26 00:57:02 30135 [Note] InnoDB: Highest supported file format is Barracuda.
2014-01-26 00:57:02 30135 [Note] InnoDB: 128 rollback segment(s) are active.
2014-01-26 00:57:02 30135 [Note] InnoDB: Waiting for purge to start
2014-01-26 00:57:02 30135 [Note] InnoDB: 5.6.15 started; log sequence number 1625987
2014-01-26 00:57:02 30135 [Note] NDB: Changed global value of binlog_format from STATEMENT to MIXED
2014-01-26 00:57:02 30135 [Note] NDB: NodeID is 50, management server '192.168.0.101:1186'
2014-01-26 00:57:02 30135 [Note] NDB[0]: NodeID: 50, all storage nodes connected
2014-01-26 00:57:02 30135 [Warning] NDB: server id set to zero - changes logged to bin log with server id zero will be logged with another server id by slave mysqlds
2014-01-26 00:57:02 30135 [Note] Starting Cluster Binlog Thread
2014-01-26 00:57:02 30135 [Warning] No existing UUID has been found, so we assume that this is the first time that this server has been started. Generating a new UUID: 62fc6e22-861c-11e3-b080-080027c51a36.
2014-01-26 00:57:02 30135 [Note] Server hostname (bind-address): '*'; port: 3306
2014-01-26 00:57:02 30135 [Note] IPv6 is available.
2014-01-26 00:57:02 30135 [Note]   - '::' resolves to '::';
2014-01-26 00:57:02 30135 [Note] Server socket created on IP: '::'.
2014-01-26 00:57:02 30135 [Note] Event Scheduler: Loaded 0 events
2014-01-26 00:57:02 30135 [Note] bin/mysqld: ready for connections.
Version: '5.6.15-ndb-7.3.4-cluster-gpl'  socket: '/tmp/cluster.sock'  port: 3306  MySQL Cluster Community Server (GPL)
Segmentation fault
^^^^^^^^
^This happens immediately

[root@olga64 cluster]# cat ndb_1_cluster.log 
2014-01-26 00:56:23 [MgmtSrvr] INFO     -- Got initial configuration from 'config-local.ini', will try to set it when all ndb_mgmd(s) started
2014-01-26 00:56:23 [MgmtSrvr] INFO     -- Id: 1, Command port: *:1186
2014-01-26 00:56:23 [MgmtSrvr] INFO     -- Node 1: Node 1 Connected
2014-01-26 00:56:23 [MgmtSrvr] INFO     -- MySQL Cluster Management Server mysql-5.6.15 ndb-7.3.4 started
2014-01-26 00:56:23 [MgmtSrvr] INFO     -- Node 1 connected
2014-01-26 00:56:23 [MgmtSrvr] INFO     -- Starting initial configuration change
2014-01-26 00:56:23 [MgmtSrvr] INFO     -- Configuration 1 commited
2014-01-26 00:56:23 [MgmtSrvr] INFO     -- Config change completed! New generation: 1
2014-01-26 00:56:35 [MgmtSrvr] INFO     -- Nodeid 3 allocated for NDB at 127.0.0.1
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 1: Node 3 Connected
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Buffering maximum epochs 100
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Start phase 0 completed 
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Initial start with nodes 3 [ missing:  no-wait:  ]
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: CM_REGCONF president = 3, own Node = 3, our dynamic id = 0/1
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Start phase 1 completed 
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Local redo log file initialization status:
#Total files: 64, Completed: 0
#Total MBytes: 1024, Completed: 0
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Local redo log file initialization completed:
#Total files: 64, Completed: 64
#Total MBytes: 1024, Completed: 1024
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Start phase 2 completed (initial start)
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Start phase 3 completed (initial start)
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: Start phase 4 completed (initial start)
2014-01-26 00:56:36 [MgmtSrvr] INFO     -- Node 3: GCP Monitor: unlimited lags allowed
2014-01-26 00:56:37 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 1 started. Keep GCI = 1 oldest restorable GCI = 1
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Local checkpoint 1 completed
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Start phase 5 completed (initial start)
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Start phase 6 completed (initial start)
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: President restarts arbitration thread [state=1]
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Start phase 7 completed (initial start)
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Start phase 8 completed (initial start)
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Start phase 9 completed (initial start)
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Start phase 100 completed (initial start)
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Start phase 101 completed (initial start)
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Started (mysql-5.6.15 ndb-7.3.4)
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Node 1: API mysql-5.6.15 ndb-7.3.4
2014-01-26 00:56:38 [MgmtSrvr] INFO     -- Node 3: Started arbitrator node 1 [ticket=75ab00010631005c]
2014-01-26 00:56:39 [MgmtSrvr] INFO     -- Node 3: Communication to Node 50 opened
2014-01-26 00:57:02 [MgmtSrvr] INFO     -- Nodeid 50 allocated for API at 192.168.0.101
2014-01-26 00:57:02 [MgmtSrvr] INFO     -- Node 50: mysqld --server-id=0
2014-01-26 00:57:02 [MgmtSrvr] INFO     -- Node 3: Node 50 Connected
2014-01-26 00:57:02 [MgmtSrvr] INFO     -- Node 3: Node 50: API mysql-5.6.15 ndb-7.3.4
2014-01-26 00:57:03 [MgmtSrvr] ALERT    -- Node 3: Node 50 Disconnected
2014-01-26 00:57:03 [MgmtSrvr] INFO     -- Node 3: Communication to Node 50 closed
2014-01-26 00:57:03 [MgmtSrvr] INFO     -- Node 3: DICT: api:0x80070032 trans:0x6b8b4567 takeover completed
2014-01-26 00:57:06 [MgmtSrvr] INFO     -- Node 3: Communication to Node 50 opened
[root@olga64 cluster]#

 

Suggested fix:
Do not crash with segmentation fault. ;-)

Hello Mario,

Thank you for the bug report.
Verified as described.

Thanks,
Umesh

This problem does not occur when using the RHEL/OL RPM package.
It seems only the generic Linux tar has this issue.
I also tried 7.3.3, same problem.

Any update? I also encounter this problem, it even did not show  a error when it crash, this is a really shit bug waste a whole day to found what's the problem it is.

I am using mysql-cluster-gpl-7.3.5-linux-glibc2.5-x86_64

if I shutdown the management node and then I can make the mysqld work.

How to resolve this problem?

Posted by developer:
 
When trying to start cluster with thread_stack=128K (in my.cnf), get the following core dump:

Program terminated with signal 11, Segmentation fault.
#0  0x00000000007ad785 in build_table_filename (
    buff=0x7f9e25ef93a0 "\210\207", bufflen=511, db=0x7f9e25efbbf0 "mysql", 
    table_name=0x7f9e25efb9e0 "ndb_schema", ext=0xe65d48 ".ARZ", flags=0, 
    was_truncated=0x7f9e25ef9678)
    at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/sql_table.cc:562
562	/export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/sql_table.cc: No such file or directory.
	in /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/sql_table.cc
Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.132.el6.x86_64 libaio-0.3.107-10.el6.x86_64 libgcc-4.4.7-4.el6.x86_64 libstdc++-4.4.7-4.el6.x86_64 nss-softokn-freebl-3.14.3-9.el6.x86_64
(gdb) where
#0  0x00000000007ad785 in build_table_filename (buff=0x7f9e25ef93a0 "\210\207", bufflen=511, db=0x7f9e25efbbf0 "mysql", table_name=0x7f9e25efb9e0 "ndb_schema", ext=0xe65d48 ".ARZ", flags=0, was_truncated=0x7f9e25ef9678) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/sql_table.cc:562
#1  0x0000000000cd371b in build_table_filename (hton=<value optimized out>, thd=<value optimized out>, db=0x200 <Address 0x200 out of bounds>, name=0x7f9e25efb9e0 "ndb_schema", frmblob=0x7f9e25efa750, frmlen=0x7f9e25efa748) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/sql_table.h:165
#2  archive_discover (hton=<value optimized out>, thd=<value optimized out>, db=0x200 <Address 0x200 out of bounds>, name=0x7f9e25efb9e0 "ndb_schema", frmblob=0x7f9e25efa750, frmlen=0x7f9e25efa748) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/storage/archive/ha_archive.cc:256
#3  0x00000000006010c3 in discover_handlerton (thd=0x7f9dfc000990, plugin=<value optimized out>, arg=<value optimized out>) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/handler.cc:5207
#4  0x0000000000775c24 in plugin_foreach_with_mask (thd=0x7f9dfc000990, func=0x601090 <discover_handlerton(THD*, plugin_ref, void*)>, type=13, state_mask=<value optimized out>, arg=0x7f9e25ef9800) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/sql_plugin.cc:2162
#5  0x000000000060288d in ha_discover (thd=0x7f9dfc000990, db=<value optimized out>, name=<value optimized out>, frmblob=<value optimized out>, frmlen=<value optimized out>) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/handler.cc:5227
#6  0x0000000000607e1b in ha_create_table_from_engine (thd=0x7f9dfc000990, db=0x7f9e25efbbf0 "mysql", name=0x7f9e25efb9e0 "ndb_schema") at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/handler.cc:4804
#7  0x0000000000bb7595 in ndb_create_table_from_engine (thd=0x7f9dfc000990, db=<value optimized out>, table_name=0xe3c2d2 "ndb_schema") at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/ha_ndbcluster.cc:11791
#8  0x0000000000bd2acd in ndb_binlog_setup (thd=0x7f9dfc000990) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/ha_ndbcluster_binlog.cc:1573
#9  0x0000000000bb0e6c in Ndb_util_thread::do_run (this=<value optimized out>) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/ha_ndbcluster.cc:15650
#10 0x0000000000be85e1 in Ndb_component::run_impl (this=0x15c60a0) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/ndb_component.cc:92
#11 0x0000000000be86a9 in Ndb_component_run_C (arg=0x15c60a0) at /export/home/pb2/build/sb_0-13403650-1412865504.46/mysql-cluster-gpl-7.3.7/sql/ndb_component.cc:51
#12 0x00007f9e3ab7e9d1 in start_thread () from /lib64/libpthread.so.0
#13 0x00007f9e398e7b6d in clone () from /lib64/libc.so.6

Apparently, the stack was exhausted: 

(gdb) p $rsp
$22 = (void *) 0x7f9e25eece30
(gdb) frame 13
#13 0x00007f9e398e7b6d in clone () from /lib64/libc.so.6
(gdb) p $rsp
$23 = (void *) 0x7f9e25f0bfe0
(gdb) p (0x7f9e25f0bfe0 - 0x7f9e25eece30)/1024
$25 = 124
(gdb) 

The work around would be to leave thread_stack at the default 256K. Suggested fix: increase the minimum allowed value for thread_stack to 256K. Alternatively, modify some of the functions on the call stack above so that they use less stack space.

Posted by developer:
 
In the core dump described above, archive_discover() uses about 49K of stack space, while ndb_binlog_setup() uses 63K. For archive_discover(), this can be explained by the local variables that it has. ndb_binlog_setup() only has a few bytes of local variables, but in this build there is extensive inlining of functions called by ndb_binlog_setup(). That means that the stack frame for ndb_binlog_setup() must have space for local variables of those inlined functions too, and this is probably the cause of the large stack footprint. If this assumption is correct, a fix could be to limit stack usage in inlineable functions called from ndb_binlog_setup(), or to somehow reduce inlining.