Bug #94610 | Server stalls because ALTER TABLE on partitioned table holds dict mutex | ||
---|---|---|---|
Submitted: | 8 Mar 2019 19:09 | Modified: | 5 Aug 2020 0:00 |
Reporter: | Justin Swanhart | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: DDL | Severity: | S5 (Performance) |
Version: | 5.6.41 | OS: | CentOS (7.5) |
Assigned to: | CPU Architecture: | x86 | |
Tags: | ALTER TABLE, DDL, partition, stall |
[8 Mar 2019 19:09]
Justin Swanhart
[8 Mar 2019 20:57]
Justin Swanhart
During the stall: Main thread process no. 3185, id 140434206619392, state: enforcing dict cache limit Number of rows inserted 390436659, updated 870464553, deleted 8315026, read 1796535553852 12.49 inserts/s, 10.99 updates/s, 1.00 deletes/s, 385152.42 reads/s table_definition_cache is set to 100000 and there are <20000 opened_definitions
[8 Mar 2019 21:33]
MySQL Verification Team
Hi Justin, The issue with DDL on many partitions is verified, waiting on possible fix. Not sure what is additional to that in this report? MySQL pre 8.x we don't have data dictionary, ddl's are not transactional, it does lock everything, that's a design choice... I'm missing something in your report obviously?! all best Bogdan
[10 Mar 2019 13:02]
Jean-François Gagné
IMHO, the difference is that this bug is about ALTER blocking everything and Bug#83435 is about ALTER being slow. In this sense, those two bugs are different, but they might have the same resolution.
[10 Mar 2019 13:08]
Jean-François Gagné
I also have similar behavior, but with TRUNCATE PARTITION and with 5.7, I think it might be related.
[11 Mar 2019 13:16]
Sveta Smirnova
Bogdan, > ddl's are not transactional, it does lock everything, This report not about transactional DDLs, but about the fact that DDLs on one table lock absolutely unrelated operations (DML on *other* tables).
[11 Mar 2019 13:38]
MySQL Verification Team
Hey Sveta, > but about the fact that DDLs on one table lock absolutely unrelated operations (DML on *other* tables) Yes, and it was always doing so (till 8.x)? Not only "dml in other tables", even dml in other database/schema on the same server. Searching the documentation for confirmation if it's "documented" or not but was always like that. Only few online ddl operations tend not to do it and only on 8.x with data dictionary in database and transactional ddl's this behaves differently. I can agree that we can have a documentation issue that this behavior is not properly documented, but I'd say it's very well known (I know I'm having issues with it privately for 20+ years)?! So yes, not a transactional issue, it's a locking issue, but is not a bug, especially not an S1 in my opinion. That's if I understand the report properly. all best Bogdan
[11 Mar 2019 13:52]
MySQL Verification Team
@Jean-François Gagné Yes, truncate in this case does drop/create so it's a DDL and not DML (trying to be faster :D ) all best Bogdan
[11 Mar 2019 17:41]
Justin Swanhart
DDL on one table is not supposed to stall the server. DROP TABLE used to stall the server and a lot of work was put into it to keep that from happening. Now DROP/ALTER on partitioned servers are holding the dictionary mutex for long periods and stalling the server. This is a ALTER on an EMPTY table. It blocks operations in DML threads that create temporary tables or that need to open a new table in the dictionary. DDL on one table is not supposed to lock up other tables for the duration of the DDL. That makes MySQL into sqlite as far as concurrency goes.
[11 Mar 2019 17:51]
Justin Swanhart
Note that a DROP or ALTER on a NON-PARTITIONED table of any size works just fine and doesn't stall the server. Only operations on PARTITIONED tables stall the server.
[11 Mar 2019 18:12]
MySQL Verification Team
Justin, I think you are missing my point, DROP on NON-PARTITIONED table iirc (I might be wrong we did change that at one point but I think this is still the case) takes the same mutex, it's just "done quickly" so it's not a problem, so a problem here is that: - it is slow operation ( Bug #83435 ) - it is blocking operation (IMO Not a Bug) Both are problems. First one is 100% a bug, as we all agreed, verified, waiting on dev team to figure out how to solve. Second one on the other hand is 100% intentional. We can argue if it's documented properly or not but this is something we have since forever. We can discuss rationale behind that decision and way to fix it, but since this is non blocking operation in 8.x (or should be non blocking in 8.x) I have my doubts of that ever being redesigned into 5.x. So, yes, I can set the bug to "verified" as, yes, this is easily verified behavior, that exist since at least 3.0 (and I'm pretty sure it was there since day 1) but all that will happen is some documentation visibility as I doubt there a chance a patch that changes this can be allowed in 5.x. hence I'm not doing it and I'm setting it to "not a bug". On the other hand, the "speed" is something that can probbly be fixed so one should follow Bug #83435 (easily verified, I hope soon to be fixed :) ). all best Bogdan p.s. I'm not 100% final on this as I still want to do a code review here to check this out but ..
[11 Mar 2019 19:53]
Justin Swanhart
This is supposed to be an ONLINE DDL operation. This DDL is not ONLINE. It blocks not only the table being modify, but the whole server for the duration of the DDL, which is tens of second for an EMPTY TABLE. This is a serious bug. Marking is as not a bug is totally wrong.
[14 Mar 2019 10:01]
Justin Swanhart
To resolve this we had to switch to MyISAM temporary tables. If you would have bothered to investigate a little bit you might have made that suggestion. The response to this bug is one of the worst I've ever seen from MySQL.
[23 Apr 2019 19:03]
Sveta Smirnova
This is still the issue in 8.0 How to Repeat. 1. Modify source code: $ git diff diff --git a/storage/innobase/row/row0mysql.cc b/storage/innobase/row/row0mysql.cc index 22d29b77f47..6e838bdae90 100644 --- a/storage/innobase/row/row0mysql.cc +++ b/storage/innobase/row/row0mysql.cc @@ -2613,6 +2613,7 @@ void row_mysql_lock_data_dictionary_func(trx_t *trx, /*!< in/out: transaction */ trx->dict_operation_lock_mode = RW_X_LATCH; mutex_enter(&dict_sys->mutex); + DBUG_EXECUTE_IF("bug94610", {my_sleep(60000000); fprintf(stderr, "bug94610\n");}); } /** Unlocks the data dictionary exclusive lock. */ 2. Run the attached test. Expected result: INSERT statement is immediate. Actual result: it takes 55 seconds: set debug="d,bug94610"; alter table t2 truncate partition p3; select now(); now() 2019-04-23 02:20:41 insert into t1 values(3); select now(); now() 2019-04-23 02:21:36 set debug=''; drop table t1, t2;
[23 Apr 2019 19:03]
Sveta Smirnova
test case for MTR
Attachment: bug94610.test (application/octet-stream, text), 1.52 KiB.
[3 May 2019 11:05]
MySQL Verification Team
Hi Sveta, Thanks for the test case but as I already wrote it's easily reproducible, problem is that I don't believe it's a bug as is and that Bug #83435 is what really needs to be fixed but since Bug #83435 is still not there I'll verify this one too (as behavior is verifiable for sure) so hopefully we have the fix for both sooner, but I'm dropping it to S5 as this is in no way S1 problem. all best Bogdan
[3 May 2019 11:31]
Jean-François Gagné
Thanks for verifying this Bogdan. But I think it is more than S5. In my case, I experience stalls in 5.7 of more than 60 seconds on TRUNCATE PARTITION of very large partitions (1.7TB). IMHO, it deserves a S1, but is you want to lower the priority, I could settle on S2. But because I am down for more than 1 minute while doing this TRUNCATE, it is more than an S3 for me.
[5 Aug 2020 0:00]
Justin Swanhart
This should not be an S5 bug...