MySQL Bugs: #96918: MySQL crash - mysqld got signal 11

Bug #96918	MySQL crash - mysqld got signal 11
Submitted:	18 Sep 2019 5:52	Modified:	18 Sep 2019 10:09
Reporter:	Prasad N	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.25	OS:	CentOS (6.8)
Assigned to:	MySQL Verification Team	CPU Architecture:	Any

Description:
I saw a MySQL crash on my semisync master. Here's what I found in the error logs.

019-09-17T20:05:47.262763Z 207 [Note] Start binlog_dump to master_thread_id(207) slave_server(241), pos(, 4)
2019-09-17T20:05:47.262785Z 207 [Note] Start semi-sync binlog_dump to slave (server_id: 241), pos(, 4)
2019-09-17T20:10:53.232978Z 0 [ERROR] /usr/sbin/mysqld: Got timeout reading communication packets
20:10:53 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=28
max_threads=500
thread_count=2
connection_count=2
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 206883 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/usr/sbin/mysqld(my_print_stacktrace+0x35)[0xf559e5]
/usr/sbin/mysqld(handle_fatal_signal+0x4a4)[0x7d4bd4]
/lib64/libpthread.so.0(+0xf5e0)[0x7fd608b905e0]
/usr/sbin/mysqld(_ZN5yaSSL3SSL12bufferedDataEv+0x30)[0x12ec080]
/usr/sbin/mysqld(yaSSL_pending+0x22)[0x12ddfe2]
/usr/sbin/mysqld(vio_ssl_has_data+0x10)[0x12d3960]
/usr/lib64/mysql/plugin/semisync_master.so(_ZN12Ack_receiver3runEv+0x3c9)[0x7fd5ec047a59]
/usr/lib64/mysql/plugin/semisync_master.so(ack_receive_handler+0x19)[0x7fd5ec047e59]
/usr/sbin/mysqld(pfs_spawn_thread+0x174)[0xf73c64]
/lib64/libpthread.so.0(+0x7de5)[0x7fd608b88de5]
/lib64/libc.so.6(clone+0x6d)[0x7fd607648f1d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
2019-09-17T20:13:23.862416Z 0 [Warning] TIMESTAMP with implicit DEFAULT value 

How to repeat:
I dont have a repeatable test case.
But this is a a very simple semisynchronous setup with a master and 2 slaves.
SSL is enabled on the MySQL servers. 
This happened when we manually stopped the io threads on one of the slaves.
But the issue did not happen when we tried the same afterwards.

Here is our my.cnf:

mysqld]
port=3306
skip_name_resolve
bind_address=0.0.0.0
user=mysql
pid_file=/var/run/mysqld/mysqld.pid
socket=/var/lib/mysql/mysql.sock
server_id=245
require_secure_transport=ON
ssl=1
log_bin=/mysql_data/binlogs/mysql-bin
expire_logs_days=21
sync_binlog=1
max_connect_errors=1000000
max_allowed_packet=16777216
max_heap_table_size=33554432
max_connections=500
max_user_connections=400
thread_cache_size=50
open_files_limit=65535
table_open_cache=2048
table_definition_cache=2048
relay_log=/mysql_data/relaylogs/relay-log
gtid_mode=ON
enforce_gtid_consistency=ON
binlog_format=MIXED
log_slave_updates=true
slave_net_timeout=60
master_info_repository=TABLE
relay_log_info_repository=TABLE
sync_master_info=10000
sync_relay_log=10000
relay_log_recovery=1
slave_parallel_workers=30
slave_preserve_commit_order=1
slave_parallel_type=LOGICAL_CLOCK
rpl_semi_sync_master_timeout=1500000
rpl_semi_sync_master_wait_for_slave_count=1
rpl_semi_sync_master_wait_no_slave=ON
relay_log_space_limit=0
max_relay_log_size=0
max_binlog_size=104857600
datadir=/mysql_data/data
general_log=OFF
general_log_file=/var/log/mysql/general.log
log_error=/var/log/mysql/mysqld.log
default_storage_engine=innodb
innodb_log_files_in_group=2
innodb_flush_log_at_trx_commit=1
innodb_flush_method=O_DIRECT
innodb_file_per_table=ON
innodb_log_file_size=134217728
innodb_buffer_pool_size=134217728
innodb_io_capacity=200
innodb_adaptive_hash_index=ON
innodb_lock_wait_timeout=50
innodb_print_all_deadlocks=ON
log_queries_not_using_indexes=OFF
log_slow_admin_statements=OFF
log_throttle_queries_not_using_indexes=0
long_query_time=10
slow_query_log=ON
slow_query_log_file=/var/log/mysql/mysql-slowquery.log
symbolic_links=0
interactive_timeout=28800
div_precision_increment=4
sql_mode="ONLY_FULL_GROUP_BY,STRICT_ALL_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION"
lower_case_table_names=0

Hi,

I cannot reproduce this using data available. From the log I'd say the crash is inside yassl library. Maybe something wrong with your SSL keys or something similar or a yaSSL bug, but nothing I can do at this point without repeatable test case.

Thanks

We've seen this same crash several times on both Percona 5.7.24 and 5.7.25. Additionally, we're using OpenSSL, so the issue doesn't appear confined to something internal to yassl.

Our setup is similar: single master --> two semisync replicas. We don't have a repro, and it occurs without us manipulating the slave status or replication setup.

Attaching a few recent error logs + our my.cnf

mysql semisync segfault, openssl

Attachment: mysql-semisync-segfault.txt (text/plain), 2.47 KiB.

my.cnf

Attachment: phemberger-mycnf.txt (text/plain), 4.02 KiB.

We've been able to reproduce this in our environment, though it can take a little time. We've found it affects clusters that are putting a lot of pressure on replication.

Our repro:

1. Create a cluster, 1 master --> 2 replicas, semisync + replication SSL enabled + RBR

2. Create a table that can contain some large rows. e.g.

CREATE TABLE `semisync_repro` (
  `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT,
  `jsonBlob` json NOT NULL,
  `updatedAt` BIGINT UNSIGNED NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8;

3. Seed the table with a bunch of data. We inserted 200K rows with random JSON values ranging between 16kb-64kb

4. Randomly pick rows and update the `updatedAt` column. We did ~1500 update/s and ran it for many hours.

We've run this trial several times. We can often trigger the semisync segfault after 3-6 hours, though it can take longer. In one case, we hit the issue 13 times over 36 hours.

On the replica side, we see errors before the segfault like: 

2019-12-11T13:00:03.847978Z 63 [ERROR] Semi-sync slave net_flush() reply failed
... (tons of these reply failed lines)
2019-12-11T13:00:03.849833Z 63 [ERROR] Semi-sync slave net_flush() reply failed
...
2019-12-11T13:00:03.855020Z 63 [ERROR] Semi-sync slave net_flush() reply failed
2019-12-11T13:00:03.857813Z 63 [ERROR] Error reading packet from server for channel '': Lost connection to MySQL server during query (server_errno=2013)
2019-12-11T13:00:03.857826Z 63 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'vt-1636194900-bin.032114' at position 51369533 for channel ''
2019-12-11T13:00:03.857835Z 63 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information.
2019-12-11T13:00:08.230723Z 64 [Note] Slave SQL for channel '': Rolling back unfinished transaction (no COMMIT or ROLLBACK in relay log). A probable cause is partial transaction left on relay log because of restarting IO thread with auto-positioning protocol. Error_code: 0
2019-12-11T13:00:24.006101Z 63 [ERROR] Semi-sync slave net_flush() reply failed
2019-12-11T13:00:24.006139Z 63 [ERROR] Semi-sync slave net_flush() reply failed

On the master, we see errors:

2019-12-11T13:00:03.848368Z 225 [Note] Stop semi-sync binlog_dump to slave (server_id: 14670181)
2019-12-11T13:00:03.848414Z 225 [Note] Aborted connection 225 to db: 'unconnected' user: 'repl' host: 'hostname' (Failed on my_net_write())
2019-12-11T13:00:03.848421Z 0 [ERROR] /usr/sbin/mysqld: Got an error reading communication packets
2019-12-11T13:00:03.883311Z 27466 [Note] Start binlog_dump to master_thread_id(27466) slave_server(14670181), pos(, 4)
2019-12-11T13:00:03.883329Z 27466 [Note] Start semi-sync binlog_dump to slave (server_id: 14670181), pos(, 4)
2019-12-11T13:00:08.060436Z 27467 [Note] Bad handshake
2019-12-11T13:00:24.006578Z 27466 [Note] Stop semi-sync binlog_dump to slave (server_id: 14670181)
2019-12-11T13:00:24.006627Z 27466 [Note] Aborted connection 27466 to db: 'unconnected' user: 'repl' host: 'hostname' (Failed on my_net_write())
2019-12-11T13:00:24.006636Z 0 [ERROR] /usr/sbin/mysqld: Got an error reading communication packets
13:00:24 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
...

Two addition notes, one good & one bad.

The bad -- we run semisync with 

rpl_semi_sync_master_timeout = 1000000000000000000
rpl_semi_sync_master_wait_no_slave = 1

which suggests that the master should never commit data that has not been acked by >=1 replica. We have found that sometimes during this crash, the failed master can come back up with writes that no other slave has received (!)

The good --

We have not been able to reproduce the failure with replication SSL disabled.

That's all we've got for now. Would love to know if there's been any movement for a fix here, this one has been very painful for us.

Hi,

> We have not been able to reproduce the failure with replication SSL disabled.

well since the crash is in the SSL library if you manage to crash mysqld with ssl disable it would not be this bug, but some other.

we cannot reproduce this, two teams tried and no luck. we have some guesses but were not able to confirm (bad connection with corrupted packets for once). Without being able to reproduce we don't have a starting point to begin fixing it.