MySQL Bugs: #88653: Random crash on replication slave, with multiple master

Bug #88653	Random crash on replication slave, with multiple master
Submitted:	25 Nov 2017 8:31	Modified:	16 Mar 2018 6:51
Reporter:	Eric Fjøsne	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.7.21	OS:	Ubuntu (16.04.3)
Assigned to:		CPU Architecture:	Any
Tags:	innodb; replication; channel;

Description:
We are experiencing repeated and random crashes on a replication slave of multiple masters.

This server is replicating several other MySQL servers via multiple channels (4 to be exact). What we can observe is that at some point the server crashes, then restarts and launches innodb recovery processes. After this, we have again a running MySQL server, but the master positioning for all channels is rolled back to some point in the past ... but not the data. It is necessary to skip the records one by one for each replication channel, until we reach the point when the crash happened, so we can resume the replication.

Here is a dump of the relevant part of the file /var/log/mysql/error.log:

==========================================
BACKTRACE BEGIN
==========================================
2017-11-24T23:00:35.795018Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 6825ms. The settings might not be optimal. (flushed=3 and evicted=0, during the time.)
2017-11-25 05:32:25 0x7fa6ecbbc700  InnoDB: Assertion failure in thread 140354913027840 in file pars0pars.cc line 822
InnoDB: Failing assertion: sym_node->table != NULL
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
04:32:25 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=8
max_threads=214
thread_count=10
connection_count=2
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 101418 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xe8a93b]
/usr/sbin/mysqld(handle_fatal_signal+0x489)[0x786749]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fce1300a390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fce123c3428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fce123c502a]
/usr/sbin/mysqld[0x75c31a]
/usr/sbin/mysqld(_Z21pars_insert_statementP10sym_node_tPvP10sel_node_t+0x3a8)[0x1024088]
/usr/sbin/mysqld(_Z7yyparsev+0x1227)[0x12455d7]
/usr/sbin/mysqld(_Z8pars_sqlP11pars_info_tPKc+0x9e)[0x10258de]
/usr/sbin/mysqld(_Z13fts_parse_sqlP11fts_table_tP11pars_info_tPKc+0x190)[0x12251e0]
/usr/sbin/mysqld(_Z14fts_write_nodeP5trx_tPP10que_fork_tP11fts_table_tP12fts_string_tP10fts_node_t+0x292)[0x11ffa52]
/usr/sbin/mysqld[0x12031d8]
/usr/sbin/mysqld(_Z14fts_sync_tableP12dict_table_tbbb+0x329)[0x1209219]
/usr/sbin/mysqld(_Z23fts_optimize_sync_tablem+0x42)[0x1210b62]
/usr/sbin/mysqld(_Z19fts_optimize_threadPv+0x57c)[0x121a0bc]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fce130006ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fce124953dd]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
==========================================
BACKTRACE END
==========================================

Server activity and configuration description:
- The instance of MySQL is running on Ubuntu Server 16.04.3 LTS 64 bits with all updates applied. Complete version of the package is: 5.7.20-0ubuntu0.16.04.1.
- The server has 160Gb of RAM, so I believe we can conclude this is not a problem of memory
- That this server is not running any activity other than those replications, at all.

What we can observe:
- The backtrace is the same for all crashes we experienced.
- Not sure whether this is linked, but we very frequently get the "[Note] Error reading relay log event for channel 'XXXX': slave SQL thread was killed - [Note] Slave I/O thread killed while reading event for channel 'XXXX'" in the /var/log/error.log file.
- We always get the ***exact same trace*** any time we experience a crash.

What we already did:
- InnoDB recovery has already been performed, automatically after the server crash and manually afterwards. Ended with no error.
- We double checked that the hardware wasn't faulty (checked the whole filesystem consistency via fsck + confirmed on the server that all was ok). All was OK.
- We checked the consistency of all databases and tables via mysqlcheck. No issue here either.
- We restored the whole database, by dropping all databases and restoring a snapshot from the masters + reconfiguring replications with the correct master positioning matching the snapshots. Replications resumed successfully.

What we are experiencing:
- Despite our attempts at solving this and approaching this from different angles, the crashes keeps randomly happening ... with this exact backtrace.

Any suggestion is more than welcome.

Thanks in advance for your help,

Eric

How to repeat:
This bug happens randomly and I have yet to find a way to reproduce it, unfortunately ...

Suggested fix:
After the crash, the server automatically restarts, but the master positioning is rolled back to some point in the past, but not the data. It is necessary to skip the records one by one for each channel, until we reach the point when the crash happened and the replication actually resumes.

This is not a fix per se, but this is how we fixed it so far.

Hi everyone,

There is also an open ticket on the Ubuntu Launchpad apparently, that matches exactly the same backtrace:
https://bugs.launchpad.net/ubuntu/+source/mysql-5.7/+bug/1729536

Best regards,

Eric

this seems to be a duplicate of
https://bugs.mysql.com/bug.php?id=88844

tables with fulltext indexes being rebuilt lead to this issue.

Unfortunately, we had big hopes but we experienced again this exact bug, with the exact same trace for version 5.1.21.

+-----------------------------+
| VERSION()                   |
+-----------------------------+
| 5.7.21-0ubuntu0.16.04.1-log |
+-----------------------------+

Trace found in error.log:

2018-03-16 05:53:38 0x7f8e4a4bf700  InnoDB: Assertion failure in thread 140249108576000 in file pars0pars.cc line 822
InnoDB: Failing assertion: sym_node->table != NULL
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
04:53:38 UTC - mysqld got signal 6 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=16777216
read_buffer_size=131072
max_used_connections=0
max_threads=214
thread_count=7
connection_count=0
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 101418 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x30000
/usr/sbin/mysqld(my_print_stacktrace+0x3b)[0xe8f29b]
/usr/sbin/mysqld(handle_fatal_signal+0x489)[0x787029]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x11390)[0x7fb5700bc390]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x38)[0x7fb56f475428]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x16a)[0x7fb56f47702a]
/usr/sbin/mysqld[0x75ca3e]
/usr/sbin/mysqld(_Z21pars_insert_statementP10sym_node_tPvP10sel_node_t+0x3a8)[0xf62a08]
/usr/sbin/mysqld(_Z7yyparsev+0x1227)[0x11839a7]
/usr/sbin/mysqld(_Z8pars_sqlP11pars_info_tPKc+0x9e)[0xf6425e]
/usr/sbin/mysqld(_Z13fts_parse_sqlP11fts_table_tP11pars_info_tPKc+0x190)[0x11635b0]
/usr/sbin/mysqld(_Z14fts_write_nodeP5trx_tPP10que_fork_tP11fts_table_tP12fts_string_tP10fts_node_t+0x292)[0x113ddb2]
/usr/sbin/mysqld[0x1141568]
/usr/sbin/mysqld(_Z14fts_sync_tableP12dict_table_tbbb+0x329)[0x11475b9]
/usr/sbin/mysqld(_Z23fts_optimize_sync_tablem+0x42)[0x114ef02]
/usr/sbin/mysqld(_Z19fts_optimize_threadPv+0x57c)[0x115845c]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x76ba)[0x7fb5700b26ba]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d)[0x7fb56f54741d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

We recovered the replication by skipping the duplicate entries that were apparently already processed.

Also, I tried to access the link https://bugs.mysql.com/bug.php?id=88844 provided by Shane, but apparently access is restricted. I would gladly have posted some feedback there, which would seem more effective than posting on a thread considered as duplicate.

Can I be granted access to this bug report ? 

Thanks,

Eric