Bug #96918 | MySQL crash - mysqld got signal 11 | ||
---|---|---|---|
Submitted: | 18 Sep 2019 5:52 | Modified: | 18 Sep 2019 10:09 |
Reporter: | Prasad N | Email Updates: | |
Status: | Can't repeat | Impact on me: | |
Category: | MySQL Server: Replication | Severity: | S2 (Serious) |
Version: | 5.7.25 | OS: | CentOS (6.8) |
Assigned to: | MySQL Verification Team | CPU Architecture: | Any |
[18 Sep 2019 5:52]
Prasad N
[18 Sep 2019 10:09]
MySQL Verification Team
Hi, I cannot reproduce this using data available. From the log I'd say the crash is inside yassl library. Maybe something wrong with your SSL keys or something similar or a yaSSL bug, but nothing I can do at this point without repeatable test case. Thanks
[28 Oct 2019 16:46]
Paul Hemberger
We've seen this same crash several times on both Percona 5.7.24 and 5.7.25. Additionally, we're using OpenSSL, so the issue doesn't appear confined to something internal to yassl. Our setup is similar: single master --> two semisync replicas. We don't have a repro, and it occurs without us manipulating the slave status or replication setup. Attaching a few recent error logs + our my.cnf
[28 Oct 2019 16:48]
Paul Hemberger
mysql semisync segfault, openssl
Attachment: mysql-semisync-segfault.txt (text/plain), 2.47 KiB.
[28 Oct 2019 16:49]
Paul Hemberger
my.cnf
Attachment: phemberger-mycnf.txt (text/plain), 4.02 KiB.
[20 Dec 2019 15:47]
Paul Hemberger
We've been able to reproduce this in our environment, though it can take a little time. We've found it affects clusters that are putting a lot of pressure on replication. Our repro: 1. Create a cluster, 1 master --> 2 replicas, semisync + replication SSL enabled + RBR 2. Create a table that can contain some large rows. e.g. CREATE TABLE `semisync_repro` ( `id` BIGINT UNSIGNED NOT NULL AUTO_INCREMENT, `jsonBlob` json NOT NULL, `updatedAt` BIGINT UNSIGNED NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB ROW_FORMAT=COMPRESSED KEY_BLOCK_SIZE=8; 3. Seed the table with a bunch of data. We inserted 200K rows with random JSON values ranging between 16kb-64kb 4. Randomly pick rows and update the `updatedAt` column. We did ~1500 update/s and ran it for many hours. We've run this trial several times. We can often trigger the semisync segfault after 3-6 hours, though it can take longer. In one case, we hit the issue 13 times over 36 hours.
[20 Dec 2019 15:48]
Paul Hemberger
On the replica side, we see errors before the segfault like: 2019-12-11T13:00:03.847978Z 63 [ERROR] Semi-sync slave net_flush() reply failed ... (tons of these reply failed lines) 2019-12-11T13:00:03.849833Z 63 [ERROR] Semi-sync slave net_flush() reply failed ... 2019-12-11T13:00:03.855020Z 63 [ERROR] Semi-sync slave net_flush() reply failed 2019-12-11T13:00:03.857813Z 63 [ERROR] Error reading packet from server for channel '': Lost connection to MySQL server during query (server_errno=2013) 2019-12-11T13:00:03.857826Z 63 [Note] Slave I/O thread: Failed reading log event, reconnecting to retry, log 'vt-1636194900-bin.032114' at position 51369533 for channel '' 2019-12-11T13:00:03.857835Z 63 [Warning] Storing MySQL user name or password information in the master info repository is not secure and is therefore not recommended. Please consider using the USER and PASSWORD connection options for START SLAVE; see the 'START SLAVE Syntax' in the MySQL Manual for more information. 2019-12-11T13:00:08.230723Z 64 [Note] Slave SQL for channel '': Rolling back unfinished transaction (no COMMIT or ROLLBACK in relay log). A probable cause is partial transaction left on relay log because of restarting IO thread with auto-positioning protocol. Error_code: 0 2019-12-11T13:00:24.006101Z 63 [ERROR] Semi-sync slave net_flush() reply failed 2019-12-11T13:00:24.006139Z 63 [ERROR] Semi-sync slave net_flush() reply failed On the master, we see errors: 2019-12-11T13:00:03.848368Z 225 [Note] Stop semi-sync binlog_dump to slave (server_id: 14670181) 2019-12-11T13:00:03.848414Z 225 [Note] Aborted connection 225 to db: 'unconnected' user: 'repl' host: 'hostname' (Failed on my_net_write()) 2019-12-11T13:00:03.848421Z 0 [ERROR] /usr/sbin/mysqld: Got an error reading communication packets 2019-12-11T13:00:03.883311Z 27466 [Note] Start binlog_dump to master_thread_id(27466) slave_server(14670181), pos(, 4) 2019-12-11T13:00:03.883329Z 27466 [Note] Start semi-sync binlog_dump to slave (server_id: 14670181), pos(, 4) 2019-12-11T13:00:08.060436Z 27467 [Note] Bad handshake 2019-12-11T13:00:24.006578Z 27466 [Note] Stop semi-sync binlog_dump to slave (server_id: 14670181) 2019-12-11T13:00:24.006627Z 27466 [Note] Aborted connection 27466 to db: 'unconnected' user: 'repl' host: 'hostname' (Failed on my_net_write()) 2019-12-11T13:00:24.006636Z 0 [ERROR] /usr/sbin/mysqld: Got an error reading communication packets 13:00:24 UTC - mysqld got signal 11 ; This could be because you hit a bug. It is also possible that this binary ...
[20 Dec 2019 15:56]
Paul Hemberger
Two addition notes, one good & one bad. The bad -- we run semisync with rpl_semi_sync_master_timeout = 1000000000000000000 rpl_semi_sync_master_wait_no_slave = 1 which suggests that the master should never commit data that has not been acked by >=1 replica. We have found that sometimes during this crash, the failed master can come back up with writes that no other slave has received (!) The good -- We have not been able to reproduce the failure with replication SSL disabled.
[20 Dec 2019 15:57]
Paul Hemberger
That's all we've got for now. Would love to know if there's been any movement for a fix here, this one has been very painful for us.
[20 Dec 2019 17:48]
MySQL Verification Team
Hi, > We have not been able to reproduce the failure with replication SSL disabled. well since the crash is in the SSL library if you manage to crash mysqld with ssl disable it would not be this bug, but some other. we cannot reproduce this, two teams tried and no luck. we have some guesses but were not able to confirm (bad connection with corrupted packets for once). Without being able to reproduce we don't have a starting point to begin fixing it.