Bug #114374 InnoDB semaphore wait bug - a thread is waiting for itself?
Submitted: 15 Mar 2024 18:38 Modified: 4 Mar 19:18
Reporter: Oleg Khozyayinov Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:8.0.32 OS:Any
Assigned to: CPU Architecture:Any

[15 Mar 2024 18:38] Oleg Khozyayinov
Description:
It looks similar to this:
https://bugs.mysql.com/bug.php?id=94930

However there is a significant difference, IMO.
It looks like a thread locked itself,
waiting thread ID and locking thread ID appear to be the same?
That looks like a bug to me. I'll upload error log too.
Here is a part of the log:

2024-03-15T12:26:55.598938Z 0 [Warning] [MY-012985] [InnoDB] A long semaphore wait:
--Thread 22333747586816 has waited at mtr0mtr.ic line 142 for 240 seconds the semaphore:
X-lock (wait_ex) on RW-latch at 0x148354428fb8 created in file buf0buf.cc line 794
a writer (thread id 22333747586816) has reserved it in mode wait exclusive
number of readers 7, waiters flag 1, lock_word: fffffffffffffff9
Last time read locked in file row0sel.cc line 4857
Last time write locked in file /local/p4clients/pkgbuild-mpJqs/workspace/src/RDSMySQL/storage/innobase/include/mtr0mtr.ic line 142

How to repeat:
Repeats randomly, looks like some race condition inside MySQL
[15 Mar 2024 18:46] Oleg Khozyayinov
Also, it would be really great if you could clarify this "semaphore crash" behavior.
My understanding is that semaphores are short-term internal locks intended only to protect integrity of internal data structures, like buffer pool. If so, how exactly it is possible that semaphore wait depends on how long a query or a transaction runs? I would not expect a semaphore to be held all the time when a query or transaction is running? But apparently, it seems to be the case?
[19 Mar 2024 11:28] MySQL Verification Team
Hi Mr. Khozvyayinov,

Thank you for your bug report.

However, let us inform you that this forum is intended only for reports with fully repeatable test cases. We do not see a test case , consisting of (even multi-threaded) sets of SQL statements which produce the error.

We can not proceed without such a test case.

However, we can still do something for you.

Can you post us just two informations. A stacktrace when the problem occurs. 

Second and much more important, we need the output of the InnoDB Status when this semaphore lock wait happens.

Let us also inform you that thread never waits for itself. Also, let us inform you that this kind of error is caused in 95 % of cases by sub-optimally written application. 

Simply, after you start transaction you do do not just issue one DML after the other, but you have waits for some other processing , user input and similar. These types of problems are usually fixed in the application.

Can't repeat.
[19 Mar 2024 16:02] Oleg Khozyayinov
Hello,
Ok, I'll work on getting InnoDB status report.

Meanwhile, could you please clarify this:
1.In the error log I do see that thread 22333747586816 was waiting for itself,
both waiting thread and locked thread have exactly same ID 22333747586816. If you say it should not happen, but we see it in logs, is it a bug or some issue with logging?

2.Very important question for us, we don't understand how semaphore wait can be connected to application behavior.
Let's assume that we start a transaction, issue some DML and then our app hangs. 
In this case, we expect that our transaction will timeout after 'wait_timeout' expires
and other transactions (waiting to lock same objects) will get timeout error when 'lock_wait_timeout' expires.
So, how can semaphore behavior can be affected by application in that case?
Our understanding is that semaphores are short term locks not under our control and they are never expected to be held 
when MySQL thread is waiting for client application to respond?

We need to clarify this because these errors affect our prod environment.
[19 Mar 2024 16:21] MySQL Verification Team
Hi,

Regarding thread waiting on itself , it can not happen. What can happen is that trx1 has taken a lock of lower priority, then trx2 came with a different lock on same record or record gap and then trx1 wants a lock escalation. Then you would have this lock waits trx1 -> trx2 -> trx1 ......

This is VERY well known phenomena with transactional engines.

Please, do note that this is not a forum for asking questions .... only for bug reports with repeatable test cases ......

Hence, we can either reproduce it or not ......

If a reporter does not supply a repeatable test case, we can not proceed ........
 
Can't repeat.
[19 Mar 2024 16:31] Oleg Khozyayinov
In the attached log, there is "INNODB MONITOR OUTPUT", "SEMAPHORES" section which lists all semaphores. Also, there is debug info in the same log (see attached log and fragment below) and it says "Submit a detailed bug report to http://bugs.mysql.com" and also it says "Most likely, you have hit a bug". So, we exactly follow your recommendations.

Also, it is directly related to this issue, it is not a general question:
We expect that our transaction will timeout after 'wait_timeout' expires
and other transactions (waiting to lock same objects) will get timeout error when 'lock_wait_timeout' expires.
However, we see different behavior and it does not match to what is described in the documentation.
Our understanding is that semaphores are short term locks not under our control and they are never expected to be held 
when MySQL thread is waiting for client application to respond?

InnoDB: ###### Diagnostic info printed to the standard error stream
2024-03-15T12:38:47.979272Z 0 [ERROR] [MY-012872] [InnoDB] [FATAL] Semaphore wait has lasted > 600 seconds. We intentionally crash the server because it appears to be hung.
2024-03-15T12:38:47.980014Z 0 [ERROR] [MY-013183] [InnoDB] Assertion failure: srv0srv.cc:1871:ib::fatal triggered thread 22334995953408
InnoDB: We intentionally generate a memory trap.
InnoDB: Submit a detailed bug report to http://bugs.mysql.com.
InnoDB: If you get repeated assertion failures or crashes, even
InnoDB: immediately after the mysqld startup, there may be
InnoDB: corruption in the InnoDB tablespace. Please refer to
InnoDB: http://dev.mysql.com/doc/refman/8.0/en/forcing-innodb-recovery.html
InnoDB: about forcing recovery.
2024-03-15T12:38:47Z UTC - mysqld got signal 6 ;
Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware.
Thread pointer: 0x0
Attempting backtrace. You can use the following information to find out
where mysqld died. If you see no messages after this, something went
terribly wrong...
stack_bottom = 0 thread_stack 0x40000
/rdsdbbin/mysql/bin/mysqld(my_print_stacktrace(unsigned char const*, unsigned long)+0x2e) [0x2120dae]
/rdsdbbin/mysql/bin/mysqld(print_fatal_signal(int)+0x343) [0x1067273]
/rdsdbbin/mysql/bin/mysqld(my_server_abort()+0x6e) [0x10673ae]
/rdsdbbin/mysql/bin/mysqld(my_abort()+0xa) [0x211b3ba]
/rdsdbbin/mysql/bin/mysqld(ut_dbg_assertion_failed(char const*, char const*, unsigned long)+0x338) [0x246b4e8]
/rdsdbbin/mysql/bin/mysqld() [0x246dc73]
/rdsdbbin/mysql/bin/mysqld(srv_error_monitor_thread()+0x794) [0x240e514]
/rdsdbbin/mysql/bin/mysqld(void Detached_thread::operator()<void (*)()>(void (*&&)())+0xb4) [0x2342db4]
/rdsdbbin/mysql/bin/mysqld() [0x2c86780]
/lib64/libpthread.so.0(+0x744b) [0x14838684744b]
/lib64/libc.so.6(clone+0x3f) [0x14838602c52f]
[20 Mar 2024 11:01] MySQL Verification Team
Hi Mr. Khozyayinov,

First of all, you have not submitted a repeatable test case. 

However, we shall answer some of your questions.

We would, first of all, recommend you to read our Reference Manual, which is available in our documentation page on dev.mysql.com.

If you have read the Manual, you would have found out that wait_timeout has nothing to do with transactions.

You would have also known that lock_wait_timeout has nothing to do with transactions either.

Hence, we would, most politely, recommend to you to read our Reference Manual.

We would also, most kindly, ask you in future  to file bug reports with repeatable test cases.

Not a bug.
[20 Mar 2024 20:44] Oleg Khozyayinov
Thank you for your answers, it was very educational.
I understand that we use Community Edition and you are not required to answer all our questions, and especially so, if we don't provide a reproducible test.
However, for exactly same reason (MySQL being a community product), I'd like to explain why we believe it is a bug in MySQL, it might help other people to read our arguments when they encounter same problem.

1.MySQL semaphores, latches, mutexes are low level 'integrity locks', our application has zero control over them, they are internal
implementation details of MySQL. For example such 'integrity locks' are used when two threads try to modify same InnoDb data page - 
allowing only one thread to proceed at time. Such InnoDB storage layer micro-transactions are in no way controlled by our application.
Same 'integrity locks' are used in many other places in MySQL, protecting buffer pool, adaptive hash index and many other structures.
These locks by design are held for much shorter time than any user-level transaction may last. Single transaction issued by user's thread can
acquire and quickly release many (hundreds, thousands or more) low level 'integrity locks' and the time these locks are held depends on MySQL implementation
details, our application does not control that. Anytime control is returned to client app, these locks are expected to be released for any given MySQL thread,
because if MySQL thread is doing nothing - then there is nothing to protect from race conditions or other multithreading issues.

2.In addition to low level 'integrity locks', MySQL uses higher level data locks which are under control of our application. These locks are visible in
Performance Schema data_locks table. For example, when two threads issue an UPDATE statement to change same row, one of threads will wait until other one 
commits or rolls back. If a user's thread cannot get this type of lock in time controlled by innodb_lock_wait_timeout - that thread will get timeout error
and after getting such error our application can either rollback or retry. These locks are always released on commit/rollback only. 
If user thread in client app is doing something else and its MySQL thread is waiting for a next query or for commit/rollback, 
these locks will be held until commit/rollback. We also know that it is ok to hold these types of locks for some time if no other thread tries to acquire them, it will become just a bit slower transaction and it will not block others unless they try to change same row in same table.
So, it is safe to say that these types of locks are controlled by applications,
we know precisely what types of locks will be held in each case, when they will be released and how our application 
will react to not being able to acquire any such lock.
An example of how such locks behave is provided here:
https://severalnines.com/blog/how-fix-lock-wait-timeout-exceeded-error-mysql

3.The error we got is caused by threads not being able to acquire internal low level 'integrity lock' (semaphore). 
We don't know what that lock is, we don't know the rules of how and when
it is acquired and when and how it is released, it is all internal implementation details of MySQL engine. So, this type of lock (semaphore) is the one described in my point #1.

4.What we know for sure, is that in our case, there was zero conflicts related to high level data locks described in point #2. 
Our application was only inserting different rows to same table, no updates, no deletes so there were no conflicts which might require higher level data lock requests 
(which are shown in Performance Schema 'data_locks' table) to wait for each other. Our application did nothing exotic, it was just inserting rows to a table concurrently.
Let's imagine that one of our client app threads is frozen and its related MySQL thread is getting no response (no commit/rollback). In this case, we don't expect any other MySQL thread to be blocked for any significant period of time, because our high level data locks are not conflicting, we just insert different rows into same table.

5.What we know is that some internal low level semaphore was held long enough (10 minutes) so that some InnoDb watcher thread considered that as server being frozen and
the only way to solve it was a restart. The error message produced by MySQL includes sentences like:
-"Most likely, you have hit a bug, but this error can also be caused by malfunctioning hardware" 
-"Submit a detailed bug report to http://bugs.mysql.com"
Our hardware looks ok so far, and even MySQL developer who wrote this error message seems to believe that it is a bug in MySQL.

6.There are two tickets which look similar (issues with semaphores held for too long) which are accepted as bugs and they don't seem to be resolved.
https://bugs.mysql.com/bug.php?id=113312
https://bugs.mysql.com/bug.php?id=91977

This was rather lengthy explanation why we believe it is some bug in MySQL.
Hopefully, all that might be helpful for other people working with Community Edition of MySQL.
[21 Mar 2024 10:43] MySQL Verification Team
Hi Mr. Khozyayinov,

Thank you for your comment.

We do recommend you to analyse your applications and see why should any of your transactions hold some mutex or lock for more than 10 minutes.
[24 Mar 2024 5:47] MySQL Verification Team
Hi Oleg,

Your error log contains "Writing a core file".
Did a core file get written?   If so can you load it into gdb on the server and get "thread apply all bt" output so we can check every thread's stack?

Also this looks like a custom built server version,  is it modified in any way from the 8.0.32 available on dev.mysql.com archives ? 

-- 
Shane Bester, MySQL Senior Principal Technical Support Engineer
Oracle Corporation
http://dev.mysql.com/
[25 Mar 2024 17:51] Oleg Khozyayinov
Hi!

We use AWS RDS for MySQL, I'd expect it to be regular Community Edition, I'll find out that. Also, I'll contact AWS support to find out if we can get that core file.

If you could help us better understand where error comes from - that would be great, maybe we can try to alleviate the problem while searching for a fix/correction. Looking at log, it shows mtr0mtr.ic to be source of problem?
Is 'mtr' related to 'micro-transactions' used by InnoDb engine internally?
Is there any chance that a slow disk IO could cause such a long latch wait?
Does it make sense to try to monitor latches in a way described here
https://dev.mysql.com/doc/refman/8.0/en/monitor-innodb-mutex-waits-performance-schema.html

Thanks for the help!
Oleg
[4 Mar 19:18] Oleg Khozyayinov
Update.
We solved this by setting innodb_thread_concurrency = 2 x core_count and the problem never appeared again.