MySQL Bugs: #90742: SIGHUP cause mysql server crash.

Bug #90742	SIGHUP cause mysql server crash.
Submitted:	4 May 2018 3:42	Modified:	3 Oct 2018 1:47
Reporter:	Seunguck Lee	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Connection Handling	Severity:	S1 (Critical)
Version:	5.7.22, 8.0.11	OS:	CentOS (Linux mysql_server 3.10.0-693.21.1.el7.x86_64 #1 SMP Wed Mar 7 19:03:37 UTC 2018 x86_64 x86_)
Assigned to:		CPU Architecture:	x86
Tags:	crash, HUP, SIGHUP

Description:
MySQL server receive a SIGHUP signal, it will flush log files and refresh several cache including thread cache.

But sometimes,  creating new thread(handle_connection() --> init_new_thd()) and refreshing thread cache(I am not sure) cause segmentation fault.

----- mysqld error log -----
Events status:
LLA = Last Locked At  LUA = Last Unlocked At
WOC = Waiting On Condition  DL = Data Locked

Event scheduler status:
State      : INITIALIZED
Thread id  : 0
LLA        : n/a:0
LUA        : n/a:0
WOC        : NO
Workers    : 0
Executed   : 0
Data locked: NO

Event queue status:
Element count   : 0
Data locked     : NO
Attempting lock : NO
LLA             : init_queue:96
LUA             : init_queue:104
WOC             : NO
Next activation : never
2018-05-04T03:31:29.534461-00:00 0 [Warning] 'user' entry 'root@localhost' ignored in --skip-name-resolve mode.
2018-05-04T03:31:29.534489-00:00 0 [Warning] 'user' entry 'mysql.sys@localhost' ignored in --skip-name-resolve mode.
2018-05-04T03:31:29.534511-00:00 0 [Warning] 'db' entry 'sys mysql.sys@localhost' ignored in --skip-name-resolve mode.
2018-05-04T03:31:29.534526-00:00 0 [Warning] 'proxies_priv' entry '@ root@localhost' ignored in --skip-name-resolve mode.
2018-05-04T03:31:29.534818-00:00 0 [Warning] 'tables_priv' entry 'sys_config mysql.sys@localhost' ignored in --skip-name-resolve mode.
[Warning]
[Warning]
[Warning]
[Warning]
03:31:29 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=33554432
read_buffer_size=131072
max_used_connections=21
max_threads=6000
thread_count=16
connection_count=20
It is possible that mysqld could use up to
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 1073018 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
./bin/mysqld(my_print_stacktrace+0x2c)[0xec10bc]
./bin/mysqld(handle_fatal_signal+0x451)[0x7b32f1]
/lib64/libpthread.so.0(+0xf5e0)[0x7f84c57d85e0]
./bin/mysqld(handle_connection+0x141)[0xd5ebd1]
./bin/mysqld(pfs_spawn_thread+0x1b4)[0x1223d74]
/lib64/libpthread.so.0(+0x7e25)[0x7f84c57d0e25]
/lib64/libc.so.6(clone+0x6d)[0x7f84c428d34d]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.
----- mysqld error log -----

How to repeat:
>> TERMINAL-1
$ mysqlslap -h mysql_server -uroot --delimiter=";" --query="select 1;" --concurrency=10 --iterations=100000000 --no-drop

>> TERMINAL-2
$ while true :; do kill -SIGHUP `pidof mysqld`; sleep 1; done

==> Looks like this is timing issue, so it's not reproduced immediately. But it will happen in a several minutes.

Hello Seunguck Lee,

Thank you for the report and feedback.

Thanks,
Umesh

I am able to reproduce this bug with only using concurrency=1 with mysqldslap and MySQL 5.7.22.

>> TERMINAL-1:
jgagne@jgagne-mysql-tests-01:~/sandboxes/msb_5_7_22$ mysqld_pid=$(pidof mysqld); while date; do kill -SIGHUP "$mysqld_pid" || break; sleep 1; done
Tue Aug 14 12:38:35 UTC 2018
Tue Aug 14 12:38:36 UTC 2018
Tue Aug 14 12:38:37 UTC 2018
Tue Aug 14 12:38:38 UTC 2018
Tue Aug 14 12:38:39 UTC 2018
Tue Aug 14 12:38:40 UTC 2018
bash: kill: (20599) - No such process

>> TERMINAL-2:
jgagne@jgagne-mysql-tests-01:~/sandboxes/msb_5_7_22$ ~/opt/mysql/5.7.22/bin/mysqlslap --defaults-file=./my.sandbox.cnf --query="select 1;" --concurrency=1 --iterations=100000000
Tue Aug 14 12:38:37 UTC 2018
^C

>> MySQL Error Logs:
[...]
Version: '5.7.22'  socket: '/tmp/mysql_sandbox5722.sock'  port: 5722  MySQL Community Server (GPL)
[...]
Event queue status:
Element count   : 0
Data locked     : NO
Attempting lock : NO
LLA             : init_queue:96
LUA             : init_queue:104
WOC             : NO
Next activation : never
[Warning] Server shutdown in progress
12:38:39 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=3
max_threads=151
thread_count=1
connection_count=3
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68195 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something wentterribly wrong...
stack_bottom = 0 thread_stack 0x40000
/home/jgagne/opt/mysql/5.7.22/bin/mysqld(my_print_stacktrace+0x35)[0xf4b6d5]
/home/jgagne/opt/mysql/5.7.22/bin/mysqld(handle_fatal_signal+0x4a4)[0x7d0d74]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7f7009d550c0]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt15_List_node_base4hookEPS_+0xf)[0x7f700901167f]
/home/jgagne/opt/mysql/5.7.22/bin/mysqld(_ZN29Per_thread_connection_handler40check_idle_thread_and_enqueue_connectionEP12Channel_info+0xc1)[0xded491]
/home/jgagne/opt/mysql/5.7.22/bin/mysqld(_ZN29Per_thread_connection_handler14add_connectionEP12Channel_info+0x13)[0xdeddc3]
/home/jgagne/opt/mysql/5.7.22/bin/mysqld(_ZN26Connection_handler_manager22process_new_connectionEP12Channel_info+0x66)[0x7ef186]
/home/jgagne/opt/mysql/5.7.22/bin/mysqld(_Z11mysqld_mainiPPc+0x1464)[0x7cb394]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f70086d72e1]
/home/jgagne/opt/mysql/5.7.22/bin/mysqld[0x7c0ed9]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

Same with MySQL 5.7.23:

>> TERMINAL-1:
jgagne@jgagne-mysql-tests-01:~/sandboxes/msb_5_7_23$ mysqld_pid=$(pidof mysqld); while date; do kill -SIGHUP "$mysqld_pid" || break; sleep 1; done
Tue Aug 14 12:48:24 UTC 2018
Tue Aug 14 12:48:26 UTC 2018
Tue Aug 14 12:48:27 UTC 2018
Tue Aug 14 12:48:28 UTC 2018
Tue Aug 14 12:48:29 UTC 2018
Tue Aug 14 12:48:30 UTC 2018
Tue Aug 14 12:48:31 UTC 2018
bash: kill: (20196) - No such process

>> TERMINAL-2:
jgagne@jgagne-mysql-tests-01:~/sandboxes/msb_5_7_23$ date; ~/opt/mysql/5.7.23/bin/mysqlslap --defaults-file=./my.sandbox.cnf --query="select 1;" --concurrency=1 --iterations=100000000
Tue Aug 14 12:48:26 UTC 2018
^C

>> MySQL Error Logs:
[...]
Version: '5.7.23'  socket: '/tmp/mysql_sandbox5724.sock'  port: 5724  MySQL Community Server (GPL)
[...]
Event queue status:
Element count   : 0
Data locked     : NO
Attempting lock : NO
LLA             : init_queue:96
LUA             : init_queue:104
WOC             : NO
Next activation : never
[Warning] Server shutdown in progress
12:48:30 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=5
max_threads=151
thread_count=2
connection_count=4
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68195 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(my_print_stacktrace+0x35)[0xf4e5e5]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(handle_fatal_signal+0x4a4)[0x7d1b64]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7f39a7f690c0]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt15_List_node_base4hookEPS_+0xf)[0x7f39a722567f]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(_ZN29Per_thread_connection_handler40check_idle_thread_and_enqueue_connectionEP12Channel_info+0xc1)[0xdeeda1]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(_ZN29Per_thread_connection_handler14add_connectionEP12Channel_info+0x13)[0xdef6d3]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(_ZN26Connection_handler_manager22process_new_connectionEP12Channel_info+0x66)[0x7f02a6]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(_Z11mysqld_mainiPPc+0x146d)[0x7cc17d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7f39a68eb2e1]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld[0x7c1cb9]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

I am also able to reproduce without mysqlslap, with MySQL 5.7.23, and only with the mysql client.  Strangely, the crash does not happen at the 1st instance of the mysql client disconnection, but at the 2nd.  Also note the "Server shutdown in progress" in the error logs that looks related to the 1st disconnection.  Without the reconnection (\r), the problem is not reproduced.

>> TERMINAL-1:
jgagne@jgagne-mysql-tests-01:~/sandboxes/msb_5_7_23$ mysqld_pid=$(pidof mysqld); while date; do kill -SIGHUP "$mysqld_pid" || break; sleep 1; done
Tue Aug 14 13:04:31 UTC 2018
Tue Aug 14 13:04:32 UTC 2018
Tue Aug 14 13:04:33 UTC 2018
Tue Aug 14 13:04:34 UTC 2018
Tue Aug 14 13:04:35 UTC 2018
Tue Aug 14 13:04:36 UTC 2018
Tue Aug 14 13:04:37 UTC 2018
Tue Aug 14 13:04:38 UTC 2018
Tue Aug 14 13:04:39 UTC 2018
bash: kill: (32313) - No such process

>> TERMINAL-2:
jgagne@jgagne-mysql-tests-01:~/sandboxes/msb_5_7_23$ date; yes "select connection_id(); \\r" | head -n 100000 | ./use -N | tail
Tue Aug 14 13:04:33 UTC 2018
ERROR 2013 (HY000) at line 1494: Lost connection to MySQL server at 'reading initial communication packet', system error: 95
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
jgagne@jgagne-mysql-tests-01:~/sandboxes/msb_5_7_23$ date; yes "select connection_id(); \\r" | head -n 100000 | ./use -N | tail
Tue Aug 14 13:04:38 UTC 2018
ERROR 2013 (HY000) at line 2: Lost connection to MySQL server at 'reading initial communication packet', system error: 95
1496
1497

>> MySQL Error Logs:
[...]
Version: '5.7.23'  socket: '/tmp/mysql_sandbox5724.sock'  port: 5724  MySQL Community Server (GPL)
[...]
[Warning] Server shutdown in progress
[...]
Event queue status:
Element count   : 0
Data locked     : NO
Attempting lock : NO
LLA             : init_queue:96
LUA             : init_queue:104
WOC             : NO
Next activation : never
13:04:38 UTC - mysqld got signal 11 ;
This could be because you hit a bug. It is also possible that this binary
or one of the libraries it was linked against is corrupt, improperly built,
or misconfigured. This error can also be caused by malfunctioning hardware.
Attempting to collect some information that could help diagnose the problem.
As this is a crash and something is definitely wrong, the information
collection process might fail.

key_buffer_size=8388608
read_buffer_size=131072
max_used_connections=2
max_threads=151
thread_count=0
connection_count=2
It is possible that mysqld could use up to 
key_buffer_size + (read_buffer_size + sort_buffer_size)*max_threads = 68195 K  bytes of memory
Hope that's ok; if not, decrease some variables in the equation.

Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(my_print_stacktrace+0x35)[0xf4e5e5]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(handle_fatal_signal+0x4a4)[0x7d1b64]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x110c0)[0x7fdc5b3460c0]
/usr/lib/x86_64-linux-gnu/libstdc++.so.6(_ZNSt15_List_node_base4hookEPS_+0xf)[0x7fdc5a60267f]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(_ZN29Per_thread_connection_handler40check_idle_thread_and_enqueue_connectionEP12Channel_info+0xc1)[0xdeeda1]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(_ZN29Per_thread_connection_handler14add_connectionEP12Channel_info+0x13)[0xdef6d3]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(_ZN26Connection_handler_manager22process_new_connectionEP12Channel_info+0x66)[0x7f02a6]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld(_Z11mysqld_mainiPPc+0x146d)[0x7cc17d]
/lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf1)[0x7fdc59cc82e1]
/home/jgagne/opt/mysql/5.7.23/bin/mysqld[0x7c1cb9]
The manual page at http://dev.mysql.com/doc/mysql/en/crashing.html contains
information that should help you find out what is causing the crash.

Two more things (sorry for the spam):

1) after the 1st disconnection, if I try to shutdown mysqld, the shutdown process hang (it looks like there is something corrupted in mysqld state).

2) if I disable the thread cache before the test (set global thread_cache_size = 0;), I am not able to reproduce the problem.

So a workaround to this could be to disable the thread cache.

Ok, after reading the code, I have a better understanding of what is happening.

When there are threads available in the the Thread Cache and on a connection establishment, a channel_info is pushed on waiting_channel_info_list, wake_pthread is incremented and the blocked threads are woken up via COND_thread_cache (all this is in check_idle_thread_and_enqueue_connection of sql/conn_handler/connection_handler_per_thread.cc).

But if above is happening while the Thread Cache is being purged (kill_blocked_pthreads_flag > 0), the woken-up thread will not consume the channel_info but will exit (in block_until_new_connection). This will leave waiting_channel_info_list with connections to be handled without threads in the cache.

After all blocked threads are on their way to exit, kill_blocked_pthreads will close all enqueued channel_info. This is what is causing the 1st "Lost connection" and the line in the logs about the shutdown in my reproduction instruction with mysql client. Closing this connection is obviously wrong, but this is not the source of the crash.

The source of the crash is that after all connection are closed in kill_blocked_pthreads, wake_pthread is still greater than 0 but there are no items in waiting_channel_info_list. Once a running thread will go to the cache (in block_until_new_connection), because wake_pthread is greater than zero, it will think there is work to be done and will pop waiting_channel_info_list. However, as the list has been emptied in kill_blocked_pthreads, this is getting a bogus pointer and causing the crash.

Actually, it is a little more complicated than explained in my last comment.

The wake_pthread greater than one after the purge is right.  But it is not the 1st disconnection after the purge that crashes things (my reproduction was showing that it is the 3rd connection).

So the 1st disconnection puts a thread in the cache and this thread consumes wake_pthread.  But as the list front is NULL, it will exits.  However, before exiting, it will pop the empty list, which corrupts the data structure.

The next connection does not crash as the Thread Cache is empty, so a connection is created.  Its disconnection will however put a thread in the cache.

On the next connection, as there is a thread in the cache, the channel_info will be pushed on a corrupted data structure (corrupted by the 1st disconnection), which cause the SIGSEGV.

Posted by developer:
 
Fixed in 8.0.14.

Server mishandling of SIGHUP signals could result in a server exit.

This has been closed as fixed in 8.0.14, but it also affect 5.7.

Can we hope for a fix in 5.7 ?

Posted by developer:
 
Fixed in 5.7.25.