Bug #90365 mysqld hangs / slave io thread hang in system lock
Submitted: 10 Apr 2018 10:07 Modified: 10 Apr 2018 11:53
Reporter: Georgi Iovchev Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server Severity:S1 (Critical)
Version:5.6.27 OS:CentOS (7.4)
Assigned to: CPU Architecture:Any (3.10.0-693.17.1.el7.x86_64)

[10 Apr 2018 10:07] Georgi Iovchev
Description:
We have issue in production environment with mysql 5.6.27 running on centos7. On random interval mysqld process hangs and becomes unresponsive. The issue happens to different servers in our environment. Last time it happened to non critical delayed slave only machine with no application connections, only monitoring.

The servers are vmware virtual machine guests running centos 7 x64 and mysql 5.6.27 community edition.
When mysqld is that state I can connect to mysql server and I can select from information_schema and performance_schema, but if I query any other database the session hangs. When I try to execute show engine innodb status or show slave status the session hangs.
There is nothing in the error log file.

Process list looks like this:
ID	USER	HOST	DB	COMMAND	TIME	STATE	INFO
1	system user		NULL	Connect	2501013	System lock	NULL
2	system user		db1	Connect	326681	updating	UPDATE ApiActivityLog SET responseTimestamp = '2018-04-06 15:51:04.009', status = 'SUCCESS', responseBody = ... WHERE (id = 468622878) 
216036	dashboard	192.168.122.121:52394	NULL	Query	283455	init	SHOW SLAVE STATUS
216035	dashboard	192.168.122.121:52392	NULL	Query	283456	init	SHOW SLAVE STATUS
216037	sc_monitor_user	192.168.114.17:43960	NULL	Query	283422	init	SHOW SLAVE STATUS
216039	sc_monitor_user	192.168.114.17:44514	NULL	Query	283356	init	SHOW SLAVE STATUS
...
239303	root	localhost	NULL	Killed	834	init	show engine innodb status
239345	sc_monitor_user	192.168.114.17:45458	NULL	Killed	0	login	NULL
239346	dashboard	192.168.122.121:59290	NULL	Killed	0	login	NULL
239347	dashboard	192.168.122.121:59332	NULL	Killed	0	login	NULL
...
239407	root	localhost	NULL	Query	0	executing	select * from information_schema.processlist

My guess is that this is due to slave io thread hang in system lock state.
Looking at the timestamp of the files I see that the last modification time of ib data, ib logs, binlogs and relay logs is the time when mysqld has hung.
It looks like all mysqld io activity has suddenly stopped.

After submitting the bug I will attach files with all information gathered - full process list, gdb backtrace, lsof and some queries from performance_schema.

How to repeat:
-

Suggested fix:
-
[10 Apr 2018 10:09] Georgi Iovchev
processlist

Attachment: processlist.txt (text/plain), 8.12 KiB.

[10 Apr 2018 10:11] Georgi Iovchev
performance_schema.threads

Attachment: threads.txt (text/plain), 8.33 KiB.

[10 Apr 2018 10:12] Georgi Iovchev
performance_schema.file_instances

Attachment: file_instances.txt (text/plain), 51.01 KiB.

[10 Apr 2018 10:12] Georgi Iovchev
performance_schema.events_statements_current

Attachment: events_statements_current.txt (text/plain), 12.89 KiB.

[10 Apr 2018 10:14] Georgi Iovchev
lsof

Attachment: lsof.txt (text/plain), 43.25 KiB.

[10 Apr 2018 10:14] Georgi Iovchev
gdb backtrace threads

Attachment: gdb_bt.txt (text/plain), 640.62 KiB.

[10 Apr 2018 10:23] MySQL Verification Team
Thank you for taking the time to report a problem.  Unfortunately you
are not using a current version of the product you reported a problem
with (current version is 5.6.39) -- the problem might already be fixed. Please download a new version from http://www.mysql.com/downloads/.

Also, there is no test case provided in the bug report and hence there
is nothing we can verify here.  If you are able to reproduce the bug
with one of the latest versions, please attach the exact reproducible
test case and change the version on this bug report to the version you
tested and change the status back to "Open".  Again, thank you for your
continued support of MySQL.
[10 Apr 2018 11:53] Georgi Iovchev
The problem can not be reproduced - it happens random - once a month or two.
I have already upgraded some of the instances, but can not be sure if this fixed the issue, because as I say it is absolutely random.