Bug #76488 | stall when dropping table | ||
---|---|---|---|
Submitted: | 25 Mar 2015 22:09 | Modified: | 27 Dec 2015 3:09 |
Reporter: | Shawn Hogan | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Cluster: Cluster (NDB) storage engine | Severity: | S2 (Serious) |
Version: | 7.4.5 | OS: | Linux (suse enterprise 11) |
Assigned to: | CPU Architecture: | Any |
[25 Mar 2015 22:09]
Shawn Hogan
[25 Mar 2015 22:13]
Shawn Hogan
mysql-bug-data-76488-ndb_error_report_20150325143502.tar.bz2 was uploaded to sftp.oracle.com/support/incoming
[26 Mar 2015 20:05]
Shawn Hogan
Forgot to mention that it also happens with TRUNCATE TABLE (since internally ndbcluster drops and recreates the table). This is the biggest issue I think... you can plan around creating tables or altering tables, but it's not uncommon for some sort of cron job to empty a table after it's done processing things.
[3 Jun 2015 15:41]
MySQL Verification Team
Hi, Thanks for the report. I have troubles reproducing this problem. This is a drop of ndb table, it does take most of the time in checking permissions but this is nowhere near what you show. mysql> SHOW PROFILE FOR QUERY 1; +--------------------------------+----------+ | Status | Duration | +--------------------------------+----------+ | starting | 0.000059 | | checking permissions | 0.000013 | | Waiting for ndbcluster global | 0.000112 | | checking permissions | 0.330233 | | query end | 0.000011 | | closing tables | 0.000010 | | freeing items | 0.000015 | | cleaning up | 0.000022 | +--------------------------------+----------+ 8 rows in set, 1 warning (0.00 sec) Also the content of your logs - I can't reproduce that (the api node disconnections). What I do see is that you have MaxBufferedEpochs problems as your subscribers are lagging behind and are being dropped for that reason. It's mostly because your subscribers are not fast enough to handle the events so they lag behind or that your api node is not properly handling events it subscribed to and is lagging behind, in any way I don't see a bug here. For properly configuring the cluster our MySQL Support team can help you a great deal kind regards Bogdan Kecman
[3 Jun 2015 15:56]
Shawn Hogan
Strange... not sure why they would lag behind. They are fairly beefy nodes and aren't doing much processing. Connected via 54Gbit InfiniBand, disk array on each node able to handle about ~1.2GB/sec, 256GB RAM, 32 Xeon cores, and they run at about 1.00 load average. We shut down all traffic to the cluster for a final test before rolling back to 7.3.x. Under zero traffic/queries we still had the same issue, reproducible 100% of the time. For us, it was 100% reproducible under 7.4 and never once happened under 7.3. If it's not a bug with MySQL Cluster, maybe it's something funky with SuSE Linux Enterprise 11 somehow with 7.4. In our situation, the MaxBufferedEpochs were lagging *because* of the system seizing up during check permissions, not the other way around. Either way, I guess we are stuck on 7.3.
[3 Jun 2015 16:13]
MySQL Verification Team
Hi Shawn, I am using 7.4.6 to test this (latest attm) but there are no changes involving this between 7.4.5 and 7.4.6. As for API's you could try to increase lock-timeout to try to handle this but I can't say it will help. What are those API nodes doing? Your ndb-api application? What are they subscribed to? All changes or only some? Even if there was a lock they are lagging 100 epoch's, there are no epoch created during log so this don't make much sense to me. You could try to increase the value MaxBufferedEpochs but I doubt that will solve your problem. I tested without replication (binlogging turned off) and with replication (binlogging on), same results, with or without traffic the drop is <1sec. You say this started with 7.4, major difference is that 7.4 comes with 5.6 mysql. I see that your SQL node was not disconnected from the cluster, do you have any other SQL nodes connected to the cluster? IMHO the problem is somewhere in the API nodes, just not sure why you don't see the problem with 7.3 all best Bogdan Kecman
[3 Jun 2015 16:30]
Shawn Hogan
We also upgraded to 7.4.6 when it came out hoping it would fix the issue (on April 16), but had the same result. April 28 is when we finally rolled back to 7.3 because we had some development work to do and well... trying to do that with 7.4 running was a non-option. We aren't using NDB API for anything other than normal SQL nodes. We have other SQL nodes connected, but they are unused (hot spares basically). We tried using the other SQL nodes and the same issue, always happening under 7.4. We even tried routing ALTER TABLE queries through other SQL nodes that were doing nothing else. Still ended up with all SQL nodes being disconnected. Either way, it wasn't an isolated event... Like I said, it literally was something that would happen 100% of the time on 7.4 when truncating tables, altering tables (when it couldn't be done online), etc. And never once in the years we have been using MySQL Cluster starting with 7.1 If you think it's related to the SQL nodes somehow, I'll put them on 7.4.6 without upgrading data nodes just to see if whatever it is in indeed specific to the API nodes.
[3 Jun 2015 16:39]
MySQL Verification Team
Hi, Sorry I assumed you are running API nodes with some custom app not that that's a regular bunch of SQL nodes. Let me retest this with more complex setup before I discard this problem. Setting this back to "analyze" status for further testing. kind regards Bogdan Kecman
[7 Jul 2015 12:29]
Mikael Ronström
This issue has been fixed in 7.4 as part of BUG#21297407. Fix will appear in 7.4.8. Problem was that drop table was waiting for up to 64 fragments before drop table could proceed. After fix drop table should be even faster than in 7.3.
[7 Jul 2015 12:36]
Mikael Ronström
After reading the description I am not 100% sure it is the same issue, the fixed bug could will block drop table during a local checkpoint waiting for up to 64 partitions of various tables to complete. So with partitions that are large, this could take quite some time. The fix removes this wait for other LCPs to complete and also ensures that even if the table itself was executing a local checkpoint it will complete ASAP. So drop table before this fix should only be blocked if a LCP is ongoing while dropping table (which it does most of the time).
[7 Jul 2015 12:40]
MySQL Verification Team
@mikael, the BUG#21297407 is not visible in the bugs.mysql.com but, yes, that one I can reproduce, anything other then BUG#21297407 I can't reproduce so I would assume this one is linked with BUG#21297407. The only way to be sure is for reporter to retest with 7.4.8
[7 Jul 2015 12:44]
MySQL Verification Team
@shawn, BUG#21297407 states: Description: Drop table can be significantly slower in 7.4 How To Repeat: Fill a table with lots of data and ensure an LCP happens at the same time Suggested Fix: Ensure that Queued LCP fragments that have their table dropped are immediately reported as done. Also ensure that executing LCP scans of tables that are being dropped are quitted early on since there is no chances that this could cause inconsistency. The fix will be available with 7.4.8 all best Bogdan
[26 Oct 2015 14:32]
Carl Krumins
The comments mention that this (or a similar bug) will be fixed in 7.4.8 however the changelog notes for 7.4.8 aren't specifically clear and don't mention that this particular external and/or internal bug #21297407. Was this fixed/improved in 7.4.8 release?
[27 Dec 2015 3:09]
Shawn Hogan
Just as an FYI, we recently upgraded our operating system to suse 12, and decided to give MySQL Cluster 7.4 a try at the same time. The bug does not appear to be happening in suse 12 with 7.4.8. So either it was indeed fixed in 7.4.8, or it was something going on with suse 11 somehow.