Bug #83079 long semaphore wait often resulting in crash, DICT_SYS mutex held by statistics
Submitted: 21 Sep 2016 13:16 Modified: 24 Apr 2017 4:45
Reporter: Mikhail Izioumtchenko (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S2 (Serious)
Version:5.7 OS:Any
Assigned to: CPU Architecture:Any

[21 Sep 2016 13:16] Mikhail Izioumtchenko
Description:
mysqld starts reporting 
2016-09-13T06:00:28.278692Z 0 [Warning] InnoDB: A long semaphore wait:
with a particular wait pattern, see below.

This can go on for a few hours and then may end up with an abort 
on InnoDB: Error: semaphore wait has lasted > 600 seconds
or suddenly dissipate. 
Either way performance is degraded.

Happens for 5.6 and 5.7 alike.
The wait pattern is like this (5.7.13):

--Thread 140220420429568 has waited at ha_innodb.cc line 5510 for 175.00 seconds the semaphore:
Mutex at 0x96eeca28, Mutex DICT_SYS created dict0dict.cc:1173, lock var 1
--Thread 140235482846976 has waited at ha_innodb.cc line 5510 for 55.00 seconds the semaphore:
Mutex at 0x96eeca28, Mutex DICT_SYS created dict0dict.cc:1173, lock var 1
...
  many more like this
...
--Thread 140237898499840 has waited at srv0srv.cc line 1968 for 101.00 seconds the semaphore:
X-lock on RW-latch at 0x96eec988 created in file dict0dict.cc line 1184
a writer (thread id 140237877520128) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file row0purge.cc line 853
Last time write locked in file /export/home/pb2/build/sb_0-19387208-1466445627.43/rpm/BUILD/mysql-5.7.13-SR12884268831/mysql-5.7.13-SR12884268831/storage/innobase/dict/dict0stats.cc line 2366

That last item would imply that the problem is in acquiring dictionary mutex and rw latch for the duration of dict_stats_save() so should only happen with persistent statistics enabled.

The workaround (unacceptable) then would be to not use persistent InnoDB statistics.

How to repeat:
no repeatable testcase but happens often, I have some 10 examples in 09/2016 and about 100 in the entire 2016, this only counts crashes.

Suggested fix:
try not to acquire dictionary mutex and x-lock on RW dictionary latch
in dict_stats_save()
[21 Sep 2016 13:23] Mikhail Izioumtchenko
DICT_SYS mutex is used for common things such as opening the table.
So if dict_stats_save() runs into problems such as having to wait for something
in the buffer pool everything will stop.
I think there is no reason for dict_stats_save() to hold anything much.
Maybe something like this can be implemented:
* client threads queue statistics update requests to the statistic thread and the latter do it in some standard MySQL way as opposed to the current usage of the internal InnoDB SQL interpreter.
[5 Dec 2016 15:45] MySQL Verification Team
Fully verified.

This bug was experienced on many occasions, but more about test case in another comment.
[3 Apr 2017 5:22] Aakanksha Verma
Requested the reporter for the reproducible scenario.
[6 Apr 2017 8:59] Aakanksha Verma
Thanks for the update from the customer. But it would be really helpful if we
could also get the core file. Request the reporter to kindly update the core
file to the bug page.
[6 Apr 2017 9:17] Naga Satyanarayana Bodapati
Posted by developer:
 

from log sent:

Last time read locked in file row0purge.cc line 853
Last time write locked in file /export/home/pb2/build/sb_0-19387208-1466445627.43/rpm/BUILD/mysql-5.7.13-SR12884268831/mysql-5.7.13-SR12884268831/storage/innobase/dict/dict0stats.cc line 2366

Some observations:

dict_stats_save() has taken dict_operation_lock and it is also might be waiting for dict_sys mutex.
2366         rw_lock_x_lock(dict_operation_lock);
2367         mutex_enter(&dict_sys->mutex);
2368 

So the conclusion that stats is holding dict_sys mutex could be wrong.

Is it possible to generate the core with debug build? We have debug structures in rw_lock which tells us
who is holding it.

Will greatly help us in fixing this bug quickly.
[7 Apr 2017 14:58] MySQL Verification Team
Mikhail,

As you can see, our Development is unable to repeat the bug. Simply, there are too many different options of acquiring and releasing the locks that lead to this particular, very long, semaphore wait.

Thank you very much in advance !!!
[24 Apr 2017 4:45] Aakanksha Verma
Closing the bug as it is no more reproducible.
[26 May 2017 16:29] Stephan Lueckl
We can reproduce this bug on our development system.
(we have about 40 databases with 500 tables each in our environment)
Happens all the time when we import a database on a running system.

We tried to reproduce it with a compiled debug-version of mysql as you wanted.
But in this case the database seems to be too slow - the bug isn't reproduceable anymore.

Any suggestions? What else can we do?
[29 May 2017 8:33] Stephan Lueckl
It is quite easy to reproduce it.

You only have to reimport a db structure dump (without data) over a existing database - run it with 2 processes simultaneously and the database will crash

We have uploaded a coredump, our config, and the error log.
[29 May 2017 10:00] MySQL Verification Team
@@[29 May 8:33] Stephan Lueckl
Please attach here using the Tab files (private if you wish) the zip file of the db structure dump file and my.cnf. I couldn't repeat with my own db. Thanks in advance.
[29 May 2017 10:10] Stephan Lueckl
[29 May 10:00] Miguel Solorzano
We added the 2 files as requested.
[16 Oct 2017 19:16] Paul Nylund
This happens constantly when I try to drop a lot of databases.

mysql -uroot -pXXXX -e "show databases" |grep -v sys| grep -v Database | grep -v mysql| grep -v information_schema| gawk '{print "drop database " $1 ";select sleep(0.5);"}' | mysql -uroot -pXXXX

It drops random number of databases until mysql just hangs and reboots.

From mysql.err log:
2017-10-16T11:33:16.238249Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140574724179712 has waited at row0row.cc line 1073 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x7fdcd8e5eb20 created in file buf0buf.cc line 1460
a writer (thread id 140574707394304) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: ffffffffdfffffff
Last time read locked in file row0row.cc line 1073
Last time write locked in file /build/mysql-5.7-ESHnMz/mysql-5.7-5.7.19/storage/innobase/btr/btr0cur.cc line 1870
2017-10-16T11:33:16.238378Z 0 [Warning] InnoDB: A long semaphore wait:
[18 Oct 2017 8:44] Stephan Lueckl
Same here -> we can reproduce  the bug too and got it verified in #86552, but since june silentium... :(
[13 Dec 2017 15:38] Stephan Lueckl
Fix for this problem found!

Changing the variable innodb_purge_threads from it’s new default value of 4, to the 5.6 default value of 1 to avoid purge thread contention

innodb_purge_threads=1

Source: https://blog.pythian.com/mysql-crashes-ddl-statement-lesson-purge-threads/
[22 Aug 2018 16:07] Ryan Brothers
I've been running into this issue in 5.7 too and can reproduce it.  I'm trying innodb_purge_threads = 1 now, but since innodb_purge_threads = 4 is the default in 5.7 and 8.0, is there still an issue to be fixed?  It's very easy to reproduce by running concurrent DDL.
[22 Aug 2018 16:20] Stephan Lueckl
I have the info that this bug was fixed in the 5.7.23, 8.0.12 release
[18 Dec 2018 10:42] Anton Ravich
We had faced this bug on 5.7.23 (AWS RDS service) with innodb_purge_threads                     = 1 . Can we reopen this bug ?
[10 Jan 2019 14:59] Przemyslaw Malkowski
Isn't this one related actually to https://bugs.mysql.com/bug.php?id=91977 where AHI is contributing to the problem?