MySQL Bugs: #90342: MySQL parallel replication automatically restarts

Bug #90342	MySQL parallel replication automatically restarts
Submitted:	9 Apr 2018 2:32	Modified:	21 Jan 2019 7:47
Reporter:	xin yang	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S2 (Serious)
Version:	5.7.16-log	OS:	Red Hat (redhat 6.8 Santiago)
Assigned to:	MySQL Verification Team	CPU Architecture:	x86 (x86_64)
Tags:	i/o thread, ibuffer, MySQL5.7.16, parallel replication, restart

Description:
The background I/O thread invokes the bug that occurs when the ibuffer cursor is reset.Here's the details.

slave  related configuration
	  innodb_buffer_pool_instances: 8
	       innodb_buffer_pool_size: 8.0G
	         innodb_file_per_table: ON
	          innodb_log_file_size: 128.0M
	     innodb_log_files_in_group: 2
	innodb_flush_log_at_trx_commit: 1
	                   sync_binlog: 1
	mysql> show variables like '%slave_para%';
	+------------------------+---------------+
	| Variable_name          | Value         |
	+------------------------+---------------+
	| slave_parallel_type    | LOGICAL_CLOCK |
	| slave_parallel_workers | 16            |
	+------------------------+---------------+

slave error_log
	2018-03-21T03:09:55.027096+08:00 1 [Note] Multi-threaded slave: Coordinator has waited 312251 times hitting slave_pending_jobs_size_max; current event size = 8176.
	2018-03-21T03:09:55.030636+08:00 1 [Note] Multi-threaded slave: Coordinator has waited 312261 times hitting slave_pending_jobs_size_max; current event size = 8176.
	2018-03-21T03:09:55.034202+08:00 1 [Note] Multi-threaded slave: Coordinator has waited 312271 times hitting slave_pending_jobs_size_max; current event size = 8176.
	2018-03-21T03:10:03.687119+08:00 0 [ERROR] InnoDB: ibuf cursor restoration fails!. ibuf record inserted to page 2464:1799
	2018-03-21T03:10:03.687154+08:00 0 [ERROR] InnoDB: Submit a detailed bug report to http://bugs.mysql.com
	PHYSICAL RECORD: n_fields 6; 1-byte offsets; info bits 0
	 0: len 4; hex 0000098a; asc     ;;
	 1: len 1; hex 00; asc  ;;
	 2: len 4; hex 00079cd7; asc     ;;
	 3: len 16; hex 000001010dfe000300210c0f00608021; asc          !   ` !;;
	 4: len 1; hex 31; asc 1;;
	 5: len 30; hex 363630313230313830323232303035313332353338343338363234373131; asc 660120180222005132538438624711; (total 31 bytes);
	PHYSICAL RECORD: n_fields 6; 1-byte offsets; info bits 0
	 0: len 4; hex 0000098a; asc     ;;
	 1: len 1; hex 00; asc  ;;
	 2: len 4; hex 00079cd7; asc     ;;
	 3: len 16; hex 000001010dfe000300210c0f00608021; asc          !   ` !;;
	 4: len 1; hex 31; asc 1;;
	 5: len 30; hex 363630313230313830323232303035313332353338343338363234373131; asc 660120180222005132538438624711; (total 31 bytes);
	DATA TUPLE: 3 fields;
	 0: len 4; hex 000009a0; asc     ;;
	 1: len 1; hex 00; asc  ;;
	 2: len 4; hex 00000707; asc     ;;
	PHYSICAL RECORD: n_fields 7; 1-byte offsets; info bits 32
	 0: len 4; hex 000009a0; asc     ;;
	 1: len 1; hex 00; asc  ;;
	 2: len 4; hex 00000707; asc     ;;
	 3: len 22; hex 01ca01010c0f000c0021860a00030000860800088000; asc          !            ;;
	 4: len 4; hex 4c494c49; asc LILI;;
	 5: len 3; hex 8fc474; asc   t;;
	 6: len 8; hex 8000000000350b6e; asc      5 n;;
	2018-03-21T03:10:03.702262+08:00 0 [ERROR] [FATAL] InnoDB: Failed to restore ibuf position.
	2018-03-21 03:10:03 0x7f4d239f7700  InnoDB: Assertion failure in thread 139969286862592 in file ut0ut.cc line 916
	InnoDB: We intentionally generate a memory trap.
	InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
	InnoDB: If you get repeated assertion failures or crashes, even
	InnoDB: immediately after the mysqld startup, there may be
	InnoDB: corruption in the InnoDB tablespace. Please refer to
	InnoDB: http://dev.mysql.com/doc/refman/5.7/en/forcing-innodb-recovery.html
	InnoDB: about forcing recovery.
	19:10:03 UTC - mysqld got signal 6 ;
	This could be because you hit a bug. It is also possible that this binary
	or one of the libraries it was linked against is corrupt, improperly built,
	or misconfigured. This error can also be caused by malfunctioning hardware.
	Attempting to collect some information that could help diagnose the problem.
	As this is a crash and something is definitely wrong, the information
	collection process might fail.

	key_buffer_size=16777216
	read_buffer_size=2097152
	max_used_connections=79
	max_threads=3000
	thread_count=30
	connection_count=12
	It is possible that mysqld could use up to 
	key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 12344415 K  bytes of memory
	Hope that's ok; if not, decrease some variables in the equation.

	Thread pointer: 0x0
	Attempting backtrace. You can use the following information to find out
	where mysqld died. If you see no messages after this, something went
	terribly wrong...
	stack_bottom = 0 thread_stack 0x40000
	/usr/local/mysql/bin/mysqld(my_print_stacktrace+0x35)[0xf1dff5]
	/usr/local/mysql/bin/mysqld(handle_fatal_signal+0x4a4)[0x79d3b4]
	/lib64/libpthread.so.0[0x358c00f7e0]
	/lib64/libc.so.6(gsignal+0x35)[0x358bc325e5]
	/lib64/libc.so.6(abort+0x175)[0x358bc33dc5]
	/usr/local/mysql/bin/mysqld[0x1159d65]
	/usr/local/mysql/bin/mysqld(_ZN2ib5fatalD1Ev+0xb3)[0x115e8b3]
	/usr/local/mysql/bin/mysqld[0x102b4d1]
	/usr/local/mysql/bin/mysqld[0x102f531]
	/usr/local/mysql/bin/mysqld(_Z29ibuf_merge_or_delete_for_pageP11buf_block_tRK9page_id_tPK11page_size_tm+0xb09)[0x1033b29]
	/usr/local/mysql/bin/mysqld(_Z20buf_page_io_completeP10buf_page_tb+0x4e1)[0x11a59a1]
	/usr/local/mysql/bin/mysqld(_Z12fil_aio_waitm+0x15b)[0x1200afb]
	/usr/local/mysql/bin/mysqld(io_handler_thread+0xc8)[0x110db48]
	/lib64/libpthread.so.0[0x358c007aa1]
	/lib64/libc.so.6(clone+0x6d)[0x358bce8aad]
	The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
	information that should help you find out what is causing the crash.

How to repeat:
The test environment
	slave
	Centos 7.4
	mysql5.7.19
	mysql> show variables like '%slave_para%';
	+------------------------+---------------+
	| Variable_name | Value |
	+------------------------+---------------+
	| slave _parallel_ type | LOGICAL _CLOCK |
	| slave_parallel _workers | 8 |
	+------------------------+---------------+

	select count(*) from sbtest1;
	500000

master perform
	UPDATE sbtest1 SET c=REPEAT(UUID(),2) where id<100000;

slave error_log

	Note] Multi-threaded slave: Coordinator has waited 208341 times hitting slave_pending_jobs_size_max; current event size = 8044
	Note] Multi-threaded slave: Coordinator has waited 208351 times hitting slave_pending_jobs_size_max; current event size = 8044
	Note] Multi-threaded slave: Coordinator has waited 208361 times hitting slave_pending_jobs_size_max; current event size = 8044
	Note] Multi-threaded slave: Coordinator has waited 208371 times hitting slave_pending_jobs_size_max; current event size = 8044

  	There are a lot of logs in similar production environments from the library, but there is no simulation of automatic restart from the library.

Suggested fix:
        Parallel replication from the library should not occur automatically restart, although the master has a large transaction.

stack trace

	nm -D -n /usr/local/mysql/bin/mysqld>/tmp/mysqld.sym

	resolve_stack_dump -s /tmp/mysqld.sym -n /tmp/err0418.log |c++filt | less

	0xf1dff5 my_print_stacktrace + 53
	0x79d3b4 handle_fatal_signal + 1188
	0x358c00f7e0 _end + -1978652160
	0x358bc325e5 _end + -1982703611
	0x358bc33dc5 _end + -1982697499
	0x1159d65 ut_dbg_assertion_failed(char const*, char const*, unsigned long) + 170
	0x115e8b3 ib::fatal::~fatal() + 179
	0x102b4d1 ibuf_print(_IO_FILE*) + 881
	0x102f531 ibuf_update_free_bits_low(buf_block_t const*, unsigned long, mtr_t*) + 3905
	0x1033b29 ibuf_merge_or_delete_for_page(buf_block_t*, page_id_t const&, page_size_t const*, unsigned long) + 2825
	0x11a59a1 buf_page_io_complete(buf_page_t*, bool) + 1249
	0x1200afb fil_aio_wait(unsigned long) + 347
	0x110db48 io_handler_thread + 200
	0x358c007aa1 _end + -1978684223
	0x358bce8aad _end + -1981956915

Thank you for the bug report. Please always when reporting a bug please test first the latest released version now it's 5.7.22. Try it thanks.

I am seeing the same exact stack trace and error message regarding the ibuf cursor position on 5.7.22

hi,Owen Owen
I want to know, are you reappearing yourself, or, in fact, have you met ibuf cursor position on 5.7.22. thanks!

Hi,
I spent quite a while testing this with 5.7.24 and 8.0.13 and I can't reproduce it. 
 - have you had this happen to you with latest available versions?
 - can you share more data about how this happens to you as I'm not able to reproduce this reliably (nor at all)

thanks
Bogdan

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".