MySQL Bugs: #83079: long semaphore wait often resulting in crash, DICT

Bug #83079	long semaphore wait often resulting in crash, DICT_SYS mutex held by statistics
Submitted:	21 Sep 2016 13:16	Modified:	24 Apr 2017 4:45
Reporter:	Mikhail Izioumtchenko (OCA)	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: InnoDB storage engine	Severity:	S2 (Serious)
Version:	5.7	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
mysqld starts reporting 
2016-09-13T06:00:28.278692Z 0 [Warning] InnoDB: A long semaphore wait:
with a particular wait pattern, see below.

This can go on for a few hours and then may end up with an abort 
on InnoDB: Error: semaphore wait has lasted > 600 seconds
or suddenly dissipate. 
Either way performance is degraded.

Happens for 5.6 and 5.7 alike.
The wait pattern is like this (5.7.13):

--Thread 140220420429568 has waited at ha_innodb.cc line 5510 for 175.00 seconds the semaphore:
Mutex at 0x96eeca28, Mutex DICT_SYS created dict0dict.cc:1173, lock var 1
--Thread 140235482846976 has waited at ha_innodb.cc line 5510 for 55.00 seconds the semaphore:
Mutex at 0x96eeca28, Mutex DICT_SYS created dict0dict.cc:1173, lock var 1
...
  many more like this
...
--Thread 140237898499840 has waited at srv0srv.cc line 1968 for 101.00 seconds the semaphore:
X-lock on RW-latch at 0x96eec988 created in file dict0dict.cc line 1184
a writer (thread id 140237877520128) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: 0
Last time read locked in file row0purge.cc line 853
Last time write locked in file /export/home/pb2/build/sb_0-19387208-1466445627.43/rpm/BUILD/mysql-5.7.13-SR12884268831/mysql-5.7.13-SR12884268831/storage/innobase/dict/dict0stats.cc line 2366

That last item would imply that the problem is in acquiring dictionary mutex and rw latch for the duration of dict_stats_save() so should only happen with persistent statistics enabled.

The workaround (unacceptable) then would be to not use persistent InnoDB statistics.

How to repeat:
no repeatable testcase but happens often, I have some 10 examples in 09/2016 and about 100 in the entire 2016, this only counts crashes.

Suggested fix:
try not to acquire dictionary mutex and x-lock on RW dictionary latch
in dict_stats_save()

DICT_SYS mutex is used for common things such as opening the table.
So if dict_stats_save() runs into problems such as having to wait for something
in the buffer pool everything will stop.
I think there is no reason for dict_stats_save() to hold anything much.
Maybe something like this can be implemented:
* client threads queue statistics update requests to the statistic thread and the latter do it in some standard MySQL way as opposed to the current usage of the internal InnoDB SQL interpreter.

Fully verified.

This bug was experienced on many occasions, but more about test case in another comment.

Requested the reporter for the reproducible scenario.

Thanks for the update from the customer. But it would be really helpful if we
could also get the core file. Request the reporter to kindly update the core
file to the bug page.

Posted by developer:
 

from log sent:

Last time read locked in file row0purge.cc line 853
Last time write locked in file /export/home/pb2/build/sb_0-19387208-1466445627.43/rpm/BUILD/mysql-5.7.13-SR12884268831/mysql-5.7.13-SR12884268831/storage/innobase/dict/dict0stats.cc line 2366

Some observations:

dict_stats_save() has taken dict_operation_lock and it is also might be waiting for dict_sys mutex.
2366         rw_lock_x_lock(dict_operation_lock);
2367         mutex_enter(&dict_sys->mutex);
2368 

So the conclusion that stats is holding dict_sys mutex could be wrong.

Is it possible to generate the core with debug build? We have debug structures in rw_lock which tells us
who is holding it.

Will greatly help us in fixing this bug quickly.

Mikhail,

As you can see, our Development is unable to repeat the bug. Simply, there are too many different options of acquiring and releasing the locks that lead to this particular, very long, semaphore wait.

Thank you very much in advance !!!

Closing the bug as it is no more reproducible.

We can reproduce this bug on our development system.
(we have about 40 databases with 500 tables each in our environment)
Happens all the time when we import a database on a running system.

We tried to reproduce it with a compiled debug-version of mysql as you wanted.
But in this case the database seems to be too slow - the bug isn't reproduceable anymore.

Any suggestions? What else can we do?

It is quite easy to reproduce it.

You only have to reimport a db structure dump (without data) over a existing database - run it with 2 processes simultaneously and the database will crash

We have uploaded a coredump, our config, and the error log.

@@[29 May 8:33] Stephan Lueckl
Please attach here using the Tab files (private if you wish) the zip file of the db structure dump file and my.cnf. I couldn't repeat with my own db. Thanks in advance.

[29 May 10:00] Miguel Solorzano
We added the 2 files as requested.

This happens constantly when I try to drop a lot of databases.

mysql -uroot -pXXXX -e "show databases" |grep -v sys| grep -v Database | grep -v mysql| grep -v information_schema| gawk '{print "drop database " $1 ";select sleep(0.5);"}' | mysql -uroot -pXXXX

It drops random number of databases until mysql just hangs and reboots.

From mysql.err log:
2017-10-16T11:33:16.238249Z 0 [Warning] InnoDB: A long semaphore wait:
--Thread 140574724179712 has waited at row0row.cc line 1073 for 241.00 seconds the semaphore:
S-lock on RW-latch at 0x7fdcd8e5eb20 created in file buf0buf.cc line 1460
a writer (thread id 140574707394304) has reserved it in mode  exclusive
number of readers 0, waiters flag 1, lock_word: ffffffffdfffffff
Last time read locked in file row0row.cc line 1073
Last time write locked in file /build/mysql-5.7-ESHnMz/mysql-5.7-5.7.19/storage/innobase/btr/btr0cur.cc line 1870
2017-10-16T11:33:16.238378Z 0 [Warning] InnoDB: A long semaphore wait:

Same here -> we can reproduce  the bug too and got it verified in #86552, but since june silentium... :(

Fix for this problem found!

Changing the variable innodb_purge_threads from it’s new default value of 4, to the 5.6 default value of 1 to avoid purge thread contention

innodb_purge_threads=1

Source: https://blog.pythian.com/mysql-crashes-ddl-statement-lesson-purge-threads/

I've been running into this issue in 5.7 too and can reproduce it.  I'm trying innodb_purge_threads = 1 now, but since innodb_purge_threads = 4 is the default in 5.7 and 8.0, is there still an issue to be fixed?  It's very easy to reproduce by running concurrent DDL.

I have the info that this bug was fixed in the 5.7.23, 8.0.12 release

We had faced this bug on 5.7.23 (AWS RDS service) with innodb_purge_threads                     = 1 . Can we reopen this bug ?

Isn't this one related actually to https://bugs.mysql.com/bug.php?id=91977 where AHI is contributing to the problem?