MySQL Bugs: #109016: Slave crashed when master restart

Bug #109016	Slave crashed when master restart
Submitted:	7 Nov 2022 11:51	Modified:	8 Nov 2022 15:55
Reporter:	Weijie Kong	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Replication	Severity:	S3 (Non-critical)
Version:	5.7.38	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Phenomenon:
Slave(Debug version) may crash when replicating and applying binlogs from master if master restarts.
Slave-config:
Gtid mode is on , MTS is on, Debug version.
Master-config:
ROW format binlog is used.
Stack:
#0  0x00007f977d3f2387 in raise () from /lib64/libc.so.6
#1  0x00007f977d3f3a78 in abort () from /lib64/libc.so.6
#2  0x00007f977d3eb1a6 in __assert_fail_base () from /lib64/libc.so.6
#3  0x00007f977d3eb252 in __assert_fail () from /lib64/libc.so.6
#4  0x00000000017c8236 in Rows_query_log_event::do_apply_event (this=0x7f96bc237970, rli=
   0x7f96bc0244c0) at /soft/mysql/source/mysql/sql/log_event.cc:13372
#5  0x00000000017d0fc0 in Log_event::do_apply_event_worker (this=0x7f96bc237970,
   w=0x7f96bc0244c0) at /soft/mysql/source/mysql/sql/log_event.cc:792
#6  0x0000000001847b22 in Slave_worker::slave_worker_exec_event (this=0x7f96bc0244c0,
   ev=0x7f96bc237970) at /soft/mysql/source/mysql/sql/rpl_rli_pdb.cc:1866
#7  0x0000000001849de7 in slave_worker_exec_job_group (worker=0x7f96bc0244c0, rli=0x4e15ea0)
   at /soft/mysql/source/mysql/sql/rpl_rli_pdb.cc:2705
#8  0x00000000018225fe in handle_slave_worker (arg=0x7f96bc0244c0)
   at /soft/mysql/source/mysql/sql/rpl_slave.cc:6266
#9  0x0000000001d3461e in pfs_spawn_thread (arg=0x7f96bc02d2d0)
   at /soft/mysql/source/mysql/storage/perfschema/pfs.cc:2197
#10 0x00007f977f179ea5 in start_thread () from /lib64/libpthread.so.0
#11 0x00007f977d4bab0d in clone () from /lib64/libc.so.6
Reason:
Master restart may cause slave IO thd reconnecting and producing partitial transaction in relaylog on slave. When gtid mode is on, binlog for this transaction will be fetched from the beginning, and re-applying from the beginning too. If there is partitial stmt(that is, stmt does't have Rows_query/Table_map/[Update_rows|Delete_rows|Write_rows] on the whole) for the partitial transaction, slave will crash when re-applying from the begging in Rows_query_log_event::do_apply_event. This is because ptr rli->rows_query_ev is reset after a stmt is finished (in function rows_event_stmt_cleanup), partitial stmt applying will not reset rli->rows_query_ev.

How to repeat:
(1)Put some workloads on master, and produce some workloads
(2)Debug version slave connects to master
(3)Kill/Shutdown master before slave IO thd catching up with master
(4)Repeat 3 till both partitial transaction and partitial stmt(Only have Rows_query event, for example) are produced in the last relay on slave
(5)Start master
(6)Wait for slave to re-applying the partitial stmt and crash

Suggested fix:
Call rli->cleanup_context in Query_log_event::do_apply_event if query is "Rollback", this will be called when rollbacking back unfinished-trans.

Workload on master was produced by sysbench Write_Only.

Thanks for the report