Bug #100058 InnoDB hangs for 500s when renaming table
Submitted: 1 Jul 2020 8:50 Modified: 3 Jul 2020 2:54
Reporter: Xinjing Zhou-from TxSQL Team at Tencent Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: InnoDB storage engine Severity:S1 (Critical)
Version:5.7.29, 5.7.30 OS:Linux
Assigned to: CPU Architecture:x86 (24 cores)
Tags: deadlock

[1 Jul 2020 8:50] Xinjing Zhou-from TxSQL Team at Tencent
Description:
When executing the attached go-program against the server, the server hangs for 500 seconds. The go-program simulates a workload (our business workload) of updating tables with secondary 
indexes mixed with renaming tables.

The server is configured using the attached my.cnf file. The database is populated with data in the attached SQL file.

After analyzing the stack trace(also attached), I think this is caused by the deadlock between renaming thread and io handler thread.

The renaming thread(Thread 30, fil_rename_tablespace) sets the node->stop_ios to TRUE and waits (fil0fil.cc:3407) until node->n_pending to drop to 0 before doing the renaming work. 

In normal cases, the io handler thread will decrement the node->n_pending after completing an IO task. However, shown in the stack trace, the io handler thread happens to read a page that requires insert-buffer merging which causes a read io to be submitted through fil_mutex_enter_and_prepare_for_io. And this fil_mutex_enter_and_prepare_for_io function then need to wait until node->stop_ios to become FALSE before moving on. So the node->n_pending will never be decremented because io handler thread and renaming thread has basically resulted in a deadlock. 

Although there is a limit on the number of waits allowed in fil_rename_tablespace, the total amount of wait time could reach 500s in the worst case (the deadlock case we just encountered) which is not acceptable. 

How to repeat:
I have simplified the settings to help reproduce the problem, here are the steps:

1. Configure the server with the attached my.cnf file.
2. Populate the database "sbtest" with the data in the attached SQL file which is a dump of 10 sysbench tables each with ~500 rows. Each table is equipped with a secondary index.
3. Run the attached go-program against the system. Basically, the go-program initiates concurrent connections to the database. Each of the connection does the following things: 
    1) 999 in 1000 chances, Randomly update a tuple on the secondary-indexed column of a random sysbench table. 
    2) Otherwise, randomly rename a sysbench table.

Assuming the go program is compiled and named as "rename-update", run the following command:
"./rename-update -P 36777 -d sbtest -f 10 -g 502 -p <your db password> -u <your db username>";

This should be sufficient to reproduce the problem.

Suggested fix:
I attached a simple patch for fixing this issue: in fil_rename_tablespace, after finding out that n_pending >0 and before waiting 20ms, set the node->stop_ios to be FALSE to allow the rest of the threads to have a chance to run. 
This is also how it is done in MySQL 8.0 which does not have this issue.
[1 Jul 2020 8:52] Xinjing Zhou-from TxSQL Team at Tencent
stack trace

Attachment: rename-hang.stack (application/octet-stream, text), 303.18 KiB.

[1 Jul 2020 8:53] Xinjing Zhou-from TxSQL Team at Tencent
dump of the database "sbtest"

Attachment: sbtest.sql (application/octet-stream, text), 958.72 KiB.

[1 Jul 2020 8:57] Xinjing Zhou-from TxSQL Team at Tencent
MySQL configuration file

Attachment: my.cnf (application/octet-stream, text), 2.36 KiB.

[1 Jul 2020 8:58] Xinjing Zhou-from TxSQL Team at Tencent
go program for reproducing the hang

Attachment: rename-update.go (application/octet-stream, text), 7.26 KiB.

[1 Jul 2020 9:08] Xinjing Zhou-from TxSQL Team at Tencent
Patch for fixing the issue

Attachment: rename-update-hang-fix.patch (application/octet-stream, text), 396 bytes.

[1 Jul 2020 12:38] MySQL Verification Team
Hi Mr. Zhou,

Thank you for your bug report.

However, this does not seem to be  bug at all.

First of all, if you read our Reference Manual, you will find out that RENAME operation waits for ALL transactions that involve this table, in any possible manner, to finish, before RENAME can be performed in the internal cache and dictionary. This is exactly your case.

Needless to mention, you are not using latest 5.7 release.

Not a bug.
[2 Jul 2020 7:00] Xinjing Zhou-from TxSQL Team at Tencent
Sorry, I am a bit confused. Why is hanging for 500s is not a bug? The io handler thread is basically doing nothing for 500s. This also means that the io thread won't handle any reads/writes assigned to it, blocking even transactions unrelated to the table being renamed for 500s. You can verify this by running "show processlist;" in MySQL client cmdline when the deadlock occurs. 

A sample result of the "show processlist;" is attached below. The table is being renamed from "sbtest6tmp" to "sbtest6". There are many blocked updates on unrelated tables.

Just verified the same issue on the latest release(using source codes of 5.7.30 from Github).
[2 Jul 2020 7:01] Xinjing Zhou-from TxSQL Team at Tencent
Sample result of the "show processlist " command when the deadlock occurs.

Attachment: show-processlist.txt (text/plain), 9.12 KiB.

[2 Jul 2020 7:04] Xinjing Zhou-from TxSQL Team at Tencent
For production servers, hanging for 500s is pretty serious, if not a disaster.
[2 Jul 2020 12:17] MySQL Verification Team
Hi,

As our documentation clearly states, there are several different things that RENAME has to wait for in order to execute. Beside this problem, your installation might not be optimally configured.

Not a bug.
[3 Jul 2020 2:54] Xinjing Zhou-from TxSQL Team at Tencent
"you will find out that RENAME operation waits for ALL transactions that involve this table, in any possible manner, to finish, before RENAME can be performed in the internal cache and dictionary.".

I get this as RENAME is a blocking operation that requires exclusive access to the table being renamed. However, in the show-processlist.txt shown above, it is actually blocking other transactions on other tables as well which is not an expected behavior according to the manual. 
Actually, these blocked transactions logically don't have anything to do with the table being renamed. Unfortunately, the io operations of these unrelated transactions happen to be submitted to the io handler which is being blocked by the deadlock, causing these transactions to hang as well.

Maybe the manual could be improved to reflect this.

"Beside this problem, your installation might not be optimally configured.".

As for the sample installation described above, it is designed for quickly reproducing the bug. Our real workload configuration has the following traits that the sample installation mimics:
1. Change Buffering = All
2. Double Write = On
3. Secondary indexes updates.
4. The working set size > capacity the buffer pool so that change buffering and merging happens.
5. Renaming tables that have secondary indexes updates on them.

So long as there are change buffering, secondary index updates, and renaming tables in the workload, it is possible to trigger the hang.
[3 Jul 2020 12:27] MySQL Verification Team
Yes, RENAME is blocking other transactions. 

Simply, it has to access two different table caches, which both have been locked.

This is partially resolved in 8.0.