MySQL Bugs: #97349: Sort-merge optimization missing

Bug #97349	Sort-merge optimization missing
Submitted:	23 Oct 2019 17:41	Modified:	24 Oct 2019 13:03
Reporter:	Alessio Bogon	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S4 (Feature request)
Version:	8.0.18	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
It seems that when using ranged filtering conditions like "indexed_col in (1, 2, 3)" together with an ORDER BY clause on the primary key, MySQL always does a filesort to reorder the result, while it could more efficiently use a sort-merge algorithm.

In the "How to repeat" section I will show that Query A will be executed by using filesort.
The problem I am currently experiencing is actually happening when having multiple range conditions on 2 indexes, but the source of the problem is the same; MySQL will not take advantage of the fact that it has to do a merge-sort over many subsets of already sorted rows.
Note that merge-sort can easily remove duplicates "on the fly", by remembering the last id.

Query B is my actual issue, but Query A shows the problem more immediately.
In Query B I use the "force index" directive because on the large dataset I am working on, it is going to be the best way to do it (I have around 100.000.000 rows in this table).

How to repeat:
CREATE TABLE `test` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `a` int(11) NOT NULL,
  `b` int(11) DEFAULT NULL,
  PRIMARY KEY (`id`),
  KEY `test_a_index` (`a`),
  KEY `test_b_index` (`b`)
) ENGINE=InnoDB AUTO_INCREMENT=16 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci;

INSERT INTO test.test (id, a, b) VALUES (1, 1, null);
INSERT INTO test.test (id, a, b) VALUES (2, 2, 1);
INSERT INTO test.test (id, a, b) VALUES (3, 3, null);
INSERT INTO test.test (id, a, b) VALUES (4, 4, 1);
INSERT INTO test.test (id, a, b) VALUES (5, 5, 3);
INSERT INTO test.test (id, a, b) VALUES (6, 3, null);
INSERT INTO test.test (id, a, b) VALUES (7, 2, 3);
INSERT INTO test.test (id, a, b) VALUES (8, 2, null);
INSERT INTO test.test (id, a, b) VALUES (9, 3, 2);
INSERT INTO test.test (id, a, b) VALUES (10, 1, 1);
INSERT INTO test.test (id, a, b) VALUES (11, 5, 1);
INSERT INTO test.test (id, a, b) VALUES (12, 2, null);
INSERT INTO test.test (id, a, b) VALUES (13, 1, 2);
INSERT INTO test.test (id, a, b) VALUES (14, 5, 2);
INSERT INTO test.test (id, a, b) VALUES (15, 1, null);

-- Query A:
explain select id
from test
where a in (1, 5)
order by id desc;

-- Result:
-- 1,SIMPLE,test,,range,test_a_index,test_a_index,4,,7,100,Using where; Using index; Using filesort

-- Query B
explain select id
from test force index(test_a_index, test_b_index)
where a in (1, 5) or b in (1, 3)
order by id asc;
-- Result
-- 1,SIMPLE,test,,index_merge,"test_a_index,test_b_index","test_a_index,test_b_index","4,5",,13,100,"Using sort_union(test_a_index,test_b_index); Using where; Using filesort"

Suggested fix:
MySQL could take advantage from the fact that the subset of rows from each individual range on the column "a" is already sorted, and thus performing the merge-sort algorithm over these ordered subsets.

Hello Mr. Bogon,

Thank you for your feature request.

I am one of the co-authors of the filesort code in MySQL, that was written back in the second half of 1990s.

I assure you that our algorithm is a variant of the sort-merge. It was based on the couple of scientific articles, that were written back then. It was also adjusted to enable usage of the permanent storage for the large result sets.

Hence, not a bug.

Thank you for your answer.

I see that I made I typo for Query A. The query I meant is the following ("asc" instead of "desc"):
explain select id
from test
where a in (1, 5)
order by id asc;

I am under the impression that "filesort" will still create a temporary table, insert the data collected, and then resort it, am I correct?
My point is that it could be completely avoided.

My real world problem is that a query like this:

select id
from test
where a in (1, 5)
order by id asc
limit 1000

takes a very long time to execute (around 10 seconds), while it could be completed in milliseconds if the dataset was not being exported to a separate location and then reordered.

For more insights, my actual table contains 40000000 records, and the cardinality of test_a_index is 263679.

Another insight that may be helpful: the number of records of that query without limit is around 3600000, which are spread all across the table

select count(*)
from test
where a in (1, 5)

Result (circa): 3600000

Hi,

By default, the file that is created is of 0 bytes length. If you wish that entire sort-merge is done in the memory, there are configuration variables for that purpose, which are described in our Reference Manual.

Do note that these are local buffers, so when you set that variable, multiply it by 100, in case that 100 connections do sort-merge in the same time. Check that you have that much RAM available.

This feature request is closed.