MySQL Bugs: #89534: Crash during innodb recovery when working with encryption

Bug #89534	Crash during innodb recovery when working with encryption
Submitted:	5 Feb 2018 13:06	Modified:	6 Feb 2018 13:42
Reporter:	Robert Golebiowski	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S1 (Critical)
Version:	5.7.21	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
There is a crash due to possible mistake in InnoDB crash recovery design. The problem occurs when crash recovery is working with encrypted table, but possibly is not limited only to encrypted tables. This problem is visible in innodb.table_encrypt_kill testcase crash, however, this testcase crash rarely. 

How to repeat:
The crash can be observed in the following scenario:

- Start server with keyring plugin loaded in.
- Create an encrypted table t1, run bunch of inserts, updates, deletes on it – like innodb.table_encrypt_kill testcase does.
- Restart the server without keyring.
- We should receive warnings that t1 cannot be loaded by innodb – due to lack of master key.
- It can so happen that some of the records from t1 will be marked to delete in undo tablespace.
- Purge threads will try to access table t1, however, t1 will not be loaded into fil_system→spaces. There will be crash on this statement :
mark_for_truncate = fil_space_get_by_id(id)->is_being_truncated;
(see the stack trace below)

The stack trace:

08:50:09 #0  0x00007f6f9c2f80c1 in __pthread_kill (threadid=<optimized out>, signo=11) at ../nptl/sysdeps/unix/sysv/linux/pthread_kill.c:61
08:50:09 #1  0x00000000018ce795 in my_write_core (sig=11) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/mysys/stacktrace.c:249
08:50:09 #2  0x0000000000ef3285 in handle_fatal_signal (sig=11) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/sql/signal_handler.cc:223
08:50:09 #3  <signal handler called>
08:50:09 #4  0x0000000001ca968f in fil_space_is_being_truncated (id=24) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/fil/fil0fil.cc:1636
08:50:09 #5  0x0000000001b011c4 in row_purge_parse_undo_rec (node=0x3a74050, undo_rec=0x7f6f64010ac0 "\025\220\016", updated_extern=0x7f6f7affcdbe, thr=0x3a73f88) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/row/row0purge.cc:875
08:50:09 #6  0x0000000001b0191b in row_purge (node=0x3a74050, undo_rec=0x7f6f64010ac0 "\025\220\016", thr=0x3a73f88) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/row/row0purge.cc:1046
08:50:09 #7  0x0000000001b01b8d in row_purge_step (thr=0x3a73f88) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/row/row0purge.cc:1128
08:50:09 #8  0x0000000001a85ba0 in que_thr_step (thr=0x3a73f88) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/que/que0que.cc:1066
08:50:09 #9  0x0000000001a85da2 in que_run_threads_low (thr=0x3a73f88) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/que/que0que.cc:1128
08:50:09 #10 0x0000000001a85f57 in que_run_threads (thr=0x3a73f88) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/que/que0que.cc:1168
08:50:09 #11 0x0000000001b5317b in srv_task_execute () at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/srv/srv0srv.cc:2842
08:50:09 #12 0x0000000001b533ca in srv_worker_thread (arg=0x0) at /mnt/workspace/mysql-5.7-param/BUILD_TYPE/debug/Host/min-jessie-x64/storage/innobase/srv/srv0srv.cc:2904
08:50:09 #13 0x00007f6f9c2f3064 in start_thread (arg=0x7f6f7affd700) at pthread_create.c:309
08:50:09 #14 0x00007f6f9a53e62d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
Testcase innodb.table_encrypt_kill

Suggested fix:
Skipping undo record that belongs to encrypted table that cannot be decrypted seems dangerous as undo tablespace should get truncated. I am not sure how deep the implication of skipping undo record could be.

Maybe recovery should stop if keyring for encrypted table is not available and only be done if some –force_recover option is used – discarding undo records for encrypted tables.

Hello Robert,

Thank you for the report.

Thanks,
Umesh

5.7.21 - test results

Attachment: 89534.results (application/octet-stream, text), 30.15 KiB.

This looks quite serious. Any updates on this bug?

we can adding check for return value of 'fil_space_get_by_id(id)' before we use it.

suggest fix as following:

fil_space_is_being_truncated(
	ulint id)
{
	bool	mark_for_truncate;
	mutex_enter(&fil_system->mutex);
	const fil_space_t* space = fil_space_get_by_id(id);
	mark_for_truncate = space && space->is_being_truncated;
	mutex_exit(&fil_system->mutex);
	return(mark_for_truncate);
}