MySQL Bugs: #67033: Mysqld got signal 11 during NDB binlog setup after cluster failure

Bug #67033	Mysqld got signal 11 during NDB binlog setup after cluster failure
Submitted:	1 Oct 2012 11:07	Modified:	6 Jun 2016 14:25
Reporter:	Przemysław Ołtarzewski	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Cluster: Cluster (NDB) storage engine	Severity:	S1 (Critical)
Version:	7.2.7	OS:	Solaris (5.10)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any
Tags:	binlog, crash, Failure, mysqld, ndb, Signal 11, solaris

Description:
1. Environment

Hardware & OS:

4 machines with SunOS 5.10 Generic_142900-04 sun4v sparc SUNW,SPARC-Enterprise-T5120, 32 GB RAM

Cluster config:

4 data nodes
2 management servers
2 replicas
4 mysqld instances

Currently there is no load on NDB engine, cluster configuration is evaluated for stability before being used in production. However, mysqld instances are already connected to the cluster.

2. Scenario:

For no obvious reason, under normal load on mysqld, ndb cluster failure was reported. This behavior is repeatable, currently under analysis, and may be related to some network issues. Normally, this kind of failure has no effect on mysqld processes connected to the cluster.

Last week, however, the cluster failure resulted in mysqld crash, due to signal 11. Backtrace found in mysqld error file suggests that something went wrong during ndb binlog setup, which resulted in SIGSEGV.

After the crash, mysqld was successfully restarted by mysqld_safe, however the database was unavailable for some time in production environment which resulted in service outage.

3. Information from mysqld error log (<hostname>.err):

[...]

120926 21:53:35 [Note] NDB Binlog: Node: 11, down, Subscriber bitmask 00
120926 21:53:35 [Note] NDB Binlog: Node: 12, down, Subscriber bitmask 00
120926 21:53:35 [Note] NDB Binlog: Node: 21, down, Subscriber bitmask 00
120926 21:53:35 [Note] NDB Binlog: Node: 22, down, Subscriber bitmask 00
120926 21:53:35 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 803477/14.
120926 21:53:35 [Note] NDB Binlog: cluster failure for ./mysql/ndb_apply_status at epoch 803477/14.
120926 21:53:35 [Note] Restarting Cluster Binlog
120926 21:53:36 [Note] Restarting Cluster Binlog
120926 21:53:37 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_schema
19:53:37 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed,
something is definitely wrong and this may fail.

key_buffer_size=402653184
read_buffer_size=2097152
max_used_connections=848
max_threads=930
thread_count=760
connection_count=758
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4212929 K bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x11c722000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:my_print_stacktrace+0x2c
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:handle_fatal_signal+0x334
/lib/sparcv9/libc.so.1:0xd6590
/lib/sparcv9/libc.so.1:0xca0f4
/lib/sparcv9/libc.so.1:0xca2ec
/lib/sparcv9/libc.so.1:strlen+0x50 [ Signal 930 (?)]
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cbEndbcluster_create_binlog_setup6FpnDTHD_pnDNdb_pkcI55pnFTABLE__i_+0x3bc
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cNha_ndbclusterGcreate6MpkcpnFTABLE_pnYst_ha_create_information__i_+0x244
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cHhandlerJha_create6MpkcpnFTABLE_pnYst_ha_create_information__i_+0x20
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cbBha_create_table_from_engine6FpnDTHD_pkc3_i_+0x1a4
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cbCndb_create_table_from_engine6FpnDTHD_pkc3_i_+0x64
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cQndb_binlog_setup6FpnDTHD__b_+0xb8
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cPNdb_util_threadGdo_run6M_v_+0x5cc
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cNNdb_componentIrun_impl6M_v_+0x44
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:Ndb_component_run_C+0xc
/lib/sparcv9/libc.so.1:0xd6464
Please read http://dev.mysql.com/doc/refman/5.1/en/resolve-stack-dump.html
and follow instructions on how to resolve the stack trace.
Resolved stack trace is much more helpful in diagnosing the
problem, so please do resolve it

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 0
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

120926 21:53:45 mysqld_safe mysqld restarted

[...]

How to repeat:
Unable to repeat - observed for the first time, no obvious reason of mysqld crash.

Suggested fix:
No possible fix identified.

Another case encountered 2012-10-09 at 06:16 A.M.

Identical scenario, this time it was possible to provide a complete set of logs and traces.

For the following reasons:
- the logs archive size is greater than 500KB,
- oracle ftp policy is to delete any provided files after 21 days,
- estimated MySQL developer team reaction time based on past experiences may be greater than 21 days,

I attach separate archives for all cluster nodes:
1, 2 - management servers
11, 12, 21, 22 - data nodes
config.ini - cluster configuration file

Management server 1 logs

Attachment: ndb_1.tar.bz2 (application/bzip2, text), 146.61 KiB.

Management server 2 logs

Attachment: ndb_2.tar.bz2 (application/bzip2, text), 140.43 KiB.

Data node 11 logs

Attachment: ndb_11.tar.bz2 (application/bzip2, text), 148.68 KiB.

Data node 12 log files

Attachment: ndb_12.tar.bz2 (application/bzip2, text), 224.33 KiB.

Data node 21 logs

Attachment: ndb_21.tar.bz2 (application/bzip2, text), 294.89 KiB.

Data node 22 logs - part 1

Attachment: ndb_22_p1.tar.bz2 (application/bzip2, text), 485.79 KiB.

Data node 22 logs - part 2

Attachment: ndb_22_p2.tar.bz2 (application/bzip2, text), 206.12 KiB.

Cluster configuration file

Attachment: config.ini (application/octet-stream, text), 2.79 KiB.

For the sake of completeness, entries found in mysqld error log for the failure on 2012-10-09 06:16 provided below:

121009  6:15:58 [Note] NDB Binlog: cluster failure for ./mysql/ndb_schema at epoch 1304855/8.
121009  6:15:59 [Note] NDB Binlog: cluster failure for ./mysql/ndb_apply_status at epoch 1304855/8.
121009  6:15:59 [Note] Restarting Cluster Binlog
121009  6:15:59 [Note] NDB Binlog: DISCOVER TABLE Event: REPL$mysql/ndb_schema
04:15:59 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

key_buffer_size=402653184
read_buffer_size=2097152
max_used_connections=816
max_threads=930
thread_count=694
connection_count=690
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 4212929 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x11c722000
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:my_print_stacktrace+0x2c
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:handle_fatal_signal+0x334
/lib/sparcv9/libc.so.1:0xd6590
/lib/sparcv9/libc.so.1:0xca0f4
/lib/sparcv9/libc.so.1:0xca2ec
/lib/sparcv9/libc.so.1:strlen+0x50 [ Signal 930 (?)]
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cbEndbcluster_create_binlog_setup6FpnDTHD_pnDNdb_pkcI55pnFTABLE__i_+0x3bc
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cNha_ndbclusterGcreate6MpkcpnFTABLE_pnYst_ha_create_information__i_+0x244
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cHhandlerJha_create6MpkcpnFTABLE_pnYst_ha_create_information__i_+0x20
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cbBha_create_table_from_engine6FpnDTHD_pkc3_i_+0x1a4
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cbCndb_create_table_from_engine6FpnDTHD_pkc3_i_+0x64
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cQndb_binlog_setup6FpnDTHD__b_+0xb8
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cPNdb_util_threadGdo_run6M_v_+0x5cc
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:__1cNNdb_componentIrun_impl6M_v_+0x44
/inc/mysql-cluster-gpl-7.2.7-solaris10-sparc-64bit/bin/mysqld:Ndb_component_run_C+0xc
/lib/sparcv9/libc.so.1:0xd6464
Please read http://dev.mysql.com/doc/refman/5.1/en/resolve-stack-dump.html
and follow instructions on how to resolve the stack trace.
Resolved stack trace is much more helpful in diagnosing the
problem, so please do resolve it

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 0
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
121009 06:16:05 mysqld_safe mysqld restarted

cannot reproduce on modern versions, iirc this one was fixed post 7.2.16

I am still getting this error on version:  5.6.37-ndb-7.4.16-cluster-gpl-log

2017-12-04 22:20:01 17979 [ERROR] NDB: Query 'DELETE FROM mysql.ndb_binlog_index WHERE File='/var/lib/mysql/binlogs/mysql-bin.006949'' failed, error: 1205: Lock wait timeout exceeded; try restarting transaction
2017-12-04 22:25:01 17979 [ERROR] NDB: Query 'DELETE FROM mysql.ndb_binlog_index WHERE File='/var/lib/mysql/binlogs/mysql-bin.006950'' failed, error: 1205: Lock wait timeout exceeded; try restarting transaction
2017-12-05 22:20:01 17979 [ERROR] NDB: Query 'DELETE FROM mysql.ndb_binlog_index WHERE File='/var/lib/mysql/binlogs/mysql-bin.006973'' failed, error: 1205: Lock wait timeout exceeded; try restarting transaction
2017-12-05 22:25:01 17979 [ERROR] NDB: Query 'DELETE FROM mysql.ndb_binlog_index WHERE File='/var/lib/mysql/binlogs/mysql-bin.006974'' failed, error: 1205: Lock wait timeout exceeded; try restarting transaction
2017-12-05 22:30:10 17979 [ERROR] NDB: Query 'DELETE FROM mysql.ndb_binlog_index WHERE File='/var/lib/mysql/binlogs/mysql-bin.006975'' failed, error: 1205: Lock wait timeout exceeded; try restarting transaction
2017-12-05 22:35:10 17979 [ERROR] NDB: Query 'DELETE FROM mysql.ndb_binlog_index WHERE File='/var/lib/mysql/binlogs/mysql-bin.006976'' failed, error: 1205: Lock wait timeout exceeded; try restarting transaction
2017-12-05 22:40:10 17979 [ERROR] NDB: Query 'DELETE FROM mysql.ndb_binlog_index WHERE File='/var/lib/mysql/binlogs/mysql-bin.006977'' failed, error: 1205: Lock wait timeout exceeded; try restarting transaction
2017-12-05 22:45:10 17979 [ERROR] NDB: Query 'DELETE FROM mysql.ndb_binlog_index WHERE File='/var/lib/mysql/binlogs/mysql-bin.006978'' failed, error: 1205: Lock wait timeout exceeded; try restarting transaction
21:45:10 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
We will try our best to scrape up some info that will hopefully help
diagnose the problem, but since we have already crashed, 
something is definitely wrong and this may fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=30
max_threads=151
thread_count=14
connection_count=13
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68108 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x7f8c6c0009b0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 7f8ca4e59910 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0x963a2b]
/usr/sbin/mysqld(handle_fatal_signal+0x491)[0x6efd01]
/lib64/libpthread.so.0(+0xf370)[0x7f8cce122370]
/lib64/libc.so.6(+0x86b71)[0x7f8cccd6fb71]
/usr/sbin/mysqld[0xbc0748]
/usr/sbin/mysqld(_ZN17Ndb_binlog_thread6do_runEv+0x364c)[0xbc8f5c]
/usr/sbin/mysqld(_ZN13Ndb_component8run_implEv+0x74)[0xbde474]
/usr/sbin/mysqld(Ndb_component_run_C+0x19)[0xbde499]
/lib64/libpthread.so.0(+0x7dc5)[0x7f8cce11adc5]
/lib64/libc.so.6(clone+0x6d)[0x7f8cccde076d]

Trying to get some variables.
Some pointers may be invalid and cause the dump to abort.
Query (0): is an invalid pointer
Connection ID (thread ID): 1
Status: NOT_KILLED

The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.