Bug #96378 Subquery with parameter is exponentially slower than hard-coded value
Submitted: 31 Jul 2019 3:00 Modified: 2 Jan 2020 9:47
Reporter: Jeff Johnson Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S4 (Feature request)
Version:8.0.17 OS:Windows (Windows 10 x64)
Assigned to: CPU Architecture:x86 (AMD Threadripper 1950X)
Tags: performance

[31 Jul 2019 3:00] Jeff Johnson
Description:
I have the following query which takes almost 5 minutes to run. If I replace @__cutOff_1 with '2019-07-24' in the query, it takes 2 seconds to run. Both return identical results.

Submissions table has 6 million rows, ip addresses table has 360K rows. Total db size is 2 GB. All necessary fields have indexes. Query explainer looks the same for both queries with the parameter and with the hard-coded value.

QUERY:

SET @__cutOff_0 = '2019-07-29';
SET @__minBanCount_1 = 2;
SET @__cutOff_1 = '2019-07-24';
SET @__minBanCount_2 = 5;

SELECT `i`.`IPAddress`, `i`.`BanCount`, MAX(`s`.`Timestamp`) AS `c`
FROM `Submissions` AS `s`
CROSS JOIN `IPAddresses` AS `i`
WHERE (((`s`.`Timestamp` >= @__cutOff_0) AND (`i`.`BanCount` >= @__minBanCount_1)) AND (`s`.`IPAddressId` = `i`.`IPAddress`)) AND `i`.`IPAddress` NOT IN (
    SELECT DISTINCT `i2`.`IPAddress`
    FROM `Submissions` AS `s2`
    CROSS JOIN `IPAddresses` AS `i2`
    WHERE (`s2`.`Timestamp` >= @__cutOff_1) AND (`i2`.`BanCount` >= @__minBanCount_2) AND (`s2`.`IPAddressId` = `i2`.`IPAddress`)
)
GROUP BY `i`.`IPAddress`, `i`.`BanCount`
ORDER BY `c` DESC
LIMIT 50000

How to repeat:
Here is my table creation script. You could honesty just create 360K random ip addresses entries, and then 8 million random submission entries pointing to the ip addresses and a few random found text entries.

CREATE TABLE `foundtext` (
  `Id` int(11) NOT NULL,
  `Text` varchar(255) NOT NULL,
  PRIMARY KEY (`Id`),
  UNIQUE KEY `IX_FoundText_Text` (`Text`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

CREATE TABLE `submissions` (
  `Timestamp` datetime(6) NOT NULL,
  `IPAddressId` varbinary(16) NOT NULL,
  `SourceIPAddressId` varbinary(16) NOT NULL,
  `Source` longtext NOT NULL,
  `UserName` longtext NOT NULL,
  `OSNameId` int(11) NOT NULL,
  `OSVersionId` int(11) NOT NULL,
  `VersionId` int(11) NOT NULL,
  `ApiKeyId` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`Timestamp`,`IPAddressId`,`SourceIPAddressId`),
  KEY `IX_Submissions_ApiKeyId` (`ApiKeyId`),
  KEY `IX_Submissions_IPAddressId` (`IPAddressId`),
  KEY `IX_Submissions_OSNameId` (`OSNameId`),
  KEY `IX_Submissions_OSVersionId` (`OSVersionId`),
  KEY `IX_Submissions_SourceIPAddressId` (`SourceIPAddressId`),
  KEY `IX_Submissions_VersionId` (`VersionId`),
  CONSTRAINT `FK_Submissions_ApiKeys_ApiKeyId` FOREIGN KEY (`ApiKeyId`) REFERENCES `apikeys` (`Id`) ON DELETE RESTRICT,
  CONSTRAINT `FK_Submissions_FoundText_OSNameId` FOREIGN KEY (`OSNameId`) REFERENCES `foundtext` (`Id`) ON DELETE RESTRICT,
  CONSTRAINT `FK_Submissions_FoundText_OSVersionId` FOREIGN KEY (`OSVersionId`) REFERENCES `foundtext` (`Id`) ON DELETE RESTRICT,
  CONSTRAINT `FK_Submissions_FoundText_VersionId` FOREIGN KEY (`VersionId`) REFERENCES `foundtext` (`Id`) ON DELETE RESTRICT,
  CONSTRAINT `FK_Submissions_IPAddresses_IPAddressId` FOREIGN KEY (`IPAddressId`) REFERENCES `ipaddresses` (`IPAddress`) ON DELETE RESTRICT,
  CONSTRAINT `FK_Submissions_IPAddresses_SourceIPAddressId` FOREIGN KEY (`SourceIPAddressId`) REFERENCES `ipaddresses` (`IPAddress`) ON DELETE RESTRICT
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

CREATE TABLE `ipaddresses` (
  `IPAddress` varbinary(16) NOT NULL,
  `BanCount` bigint(20) NOT NULL,
  `AttackedCount` bigint(20) NOT NULL,
  `Details` varchar(1024) NOT NULL,
  `Flags` int(11) NOT NULL,
  PRIMARY KEY (`IPAddress`),
  KEY `IX_IPAddresses_AttackedCount` (`AttackedCount`),
  KEY `IX_IPAddresses_BanCount` (`BanCount`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

Suggested fix:
This seems like a serious performance bug in the internal query parser / handler of MySQL.

Even if full table scans were being done, I would not expect this query to take 5 minutes, it is not very complex and I have added all the necessary indexes...
[31 Jul 2019 12:35] MySQL Verification Team
Hi Mr. Johnson,

Thank you for your bug report.

What you describe is a known behaviour of our optimiser. Simply, in order for index to be used, you can not use user variables. You can circumvent this in your application.

However, I find that this is a very realistic feature request, so I am suggesting to you to convert this report into a feature request.

Your feedback will be appreciated.
[31 Jul 2019 14:56] Jeff Johnson
Converted to feature request. Just FYI, MS SQL Server has no problem with the parameters and runs the query quickly either way in just under 2 seconds, same data set.
[31 Jul 2019 15:01] Jeff Johnson
When parameters are used not in a sub-query they do not suffer the performance issue as well, very odd.
[31 Jul 2019 15:10] MySQL Verification Team
Hello Mr. Johnson,

We do know that a problem is restricted to the nested queries. Simply, our optimiser currently concludes that if user variable is present that the nested query is a dependent one, which can not be converted to sem-join.

Henceforth, this is very welcome feature request.

Verified.
[2 Dec 2019 9:47] Roy Lyseng
Hi Jeff,

if you still have problems with this case, we'd like to reopen it.

In version 8.0.18, we got a new feature EXPLAIN ANALYZE that will better provide information about what parts of a query contribute to the total cost.

Please, can you run EXPLAIN ANALYZE on two queries:
1. The original query
2. The original query with user variables substituted with constant values.

In version 8.0.17, we started optimizing NOT IN and NOT EXISTS in a hopefully better way, like a semi-join strategy. The new optimizer strategy can be de-selected with this setting:
set optimizer_switch='semijoin=off';

If possible, can you try to run the query with both setting (semijoin on and off)?

It is a bit tedious for us to try to recreate your data, and when you run the queries, we know it is the explained plan that causes you problems.
[3 Jan 2020 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[7 Jan 2020 14:48] MySQL Verification Team
Hi Mr. Johnson,

Is there a chance that we get a feedback from you on our questions ???