MySQL Bugs: #87998: MySQL Group By slow even when indexed

Bug #87998	MySQL Group By slow even when indexed
Submitted:	5 Oct 2017 16:42	Modified:	10 Oct 2017 15:07
Reporter:	Bram Kivenko	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	5.7	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	query performance

Description:
For larger datasets a simple group-by query is very inefficient even though separating the query into 2 steps, that in totality should create more workload, ends up being up to 10x faster in practice (on my hardware).

This is true in a Windows environment on a laptop as well as on a VM running Linux.  It doesn't seem to be infrastructure specific.  There seems to be a fundamental deficiency in the aggregation logic that does not apply when querying from a table that contains exactly the necessary information and no more.

In practice the number of resultant records output from the group-by is about 1000 to 2000.  The decline in performance is not related to the number of resultant records, but rather is related to the number of input records, and the decline is non-linear.

For example, each day yields roughly the same number of colA x colB combinations, and separating the query into 10 or 20 steps instead of querying a whole month at once will yield the performance of the second query construction form.  Yet, for some reason, putting a month's worth of data into a temporary table and then grouping on that will not cause a decline in performance.

How to repeat:
The intuitive query construction is slow:

SELECT colA, colB, count(*)
FROM someTable use index (idx_someTable_Date_colA_colB)
WHERE Date >= ? and Date < ?
GROUP BY colA, colB;

(I provide the "use index" hint to indicate here that there is an exact index available for the query)

--------

The 2-step query construction is up to 10x faster for larger datasets:

DROP TEMPORARY TABLE IF EXISTS TEMPT;
CREATE TEMPORARY TABLE TEMPT (
  colA bigint(20),
  colB varchar(50)
) 
AS (
  SELECT colA, colB
  FROM someTable use index (idx_someTable_Date_colA_colB)
  WHERE Date >= ? and Date < ?
);
SELECT colA, colB, count(*)
FROM TEMPT
GROUP BY colA, colB;

-----

Date is of type "date".

As there seems to be no obvious reason why the second form should be materially faster than the first form, there is obviously something wrong with the execution of the first query.

I will note that the average record size for someTable in practice is ~400 bytes, but since necessary information is available from the index, I don't believe it is relevant.

Hi!

Thank you for your bug report. What we see is actually the expected behaviour. It is , of course, much faster to aggregate over the table that is 70 bytes wide, in comparison with the table that is 400 bytes wide. This is a simple law of physics.

We would need a full and always repeatable test case to see if there is anything at odds, but before you do that, please try the following two / three steps:

* Set latin1 on all CHAR or VARCHAR columns
* If that is not possible, use some other fixed-length character set
* Apply fixed length character set, both for table and for all other charset settings ...
* Use FORCE INDEX instead of the USE INDEX
* See the length of your VARCHAR columns and see whether you can convert it to CHAR. For example, if all your colB tuples contain between 30 and 40 chars, set it to CHAR(40)

Please, let us know how it worked and if it does not we will have to try and repeat your test case. I must point out, one more time, that aggregating over 70 bytes is MUCH faster then aggregating over 400 bytes. Regarding index, it might not be wise to use if filtering conditions are done by index search, which would result in much smaller table ...

I hope that I was clear enough.

I will try to create a repeatable example, however, while I do this, I want to make sure the problem is described correctly, because there seems to be a misunderstanding:

There are two constructions offered, one seems to be 10x faster than the other.

The slow construction does this: Queries a table for information of interest, which is indexed and the index is definitely being used, the results of that query are aggregated into a small result set of < 2000 records.

The fast construction does this: Queries a table for information of interest, inserts that information into a new temporary table, the new table is then table scanned to produce the same result set above.

To be clear, I am counting time from the start of querying the original table in both instances, therefore the workload cannot be less for the second construction.

I will attempt to prove that: this doesn't have anything to do with row length, and that even if I additionally group by a column that is not indexed (and therefore requires the original records be read in full) that the second construction will still be faster.

We are waiting on your feedback.

Please, follow all other instructions as well.

Upon further investigation...

It turns out, it is caused by the underlying table actually having colB defined as type text, not a varchar.  The performance difference is caused exclusively because of this.

However, this leaves me with 2 issues:

(a) The index can be used without referencing the table rows at all, but this optimization does not seem to apply with group-by.

(b) temporary tables seem to take a huge hit for text columns even though the text content itself is small.  The DYNAMIC row format, for example is supposed to store small text data inline, but this does not translate to temporary tables, or at least the intended performance does not translate.

So, this is not a bug, but there is some obvious performance enhancement available here that is not being used.

Hi!

Thank you for your investigations. I agree that this is not a bug, but I do not see a way how this situation could be optimised any further. You see, CLOB columns, which include both TEXT and BLOB, are stored in separate pages. That is how InnoDB SE is designed and has many advantages for fetching parts of the tuples, when CLOB is not required. 

The only possibility is to use VARCHAR, if possible.

Before closing the bug report, I am curious to know what happens with the following statement:

SELECT colA, colB, count(*)
FROM someTable use index (idx_someTable_Date_colA_colB)
WHERE Date >= ? and Date < ?
GROUP BY colA, colB || '';

(1)  The above executes with the same performance as a varchar even if colB is a text column.  Does it truncate long content for colB?  Does it have other potential drawbacks?

(2) Can't short text content be embedded inline in the internal temp table?  It appears this optimization is available with DYNAMIC row formats when selecting from the original table, but the temp table seemingly does not make use of this optimization.

Adding that operation at the end changes last column type. Speed will depend most of all of the lengths of the contents in colB in most of the tuples. So, your expression might slow down everything or accelerate , depending on the contents.

When the optimiser chooses the algorithms, he uses only global info plus indices, does not scan a table.

DYNAMIC row formats is an old format, not used any more.