Description:
Hi, MySQL developers, thanks for reading my report. I find a missed optimization in MySQL.
When a WHERE clause uses (t0.c0, t1.c0) IN ((1,1), (2,2)), the MySQL optimizer fails to push the individual column-value conditions down to the respective tables. Instead, it generates a cross join (inner hash join with no join condition) between a small filtered set from one table and the entire other table, then applies the row-value filter on the huge intermediate result.
This causes a full index scan of the second table, even when the constants would allow an efficient index range scan on that table.
The logically equivalent condition (t0.c0 = 1 AND t1.c0 = 1) OR (t0.c0 = 2 AND t1.c0 = 2) produces the expected plan: both tables are accessed via range scans, the join produces only 4 rows, and the filter is trivially applied.
Such predicates are a widespread, natural way to filter joins by paired values (product‑category IDs, user‑role pairs, etc.) and should be optimized natively by the engine.
How to repeat:
1. Create table and insert data
CREATE TABLE t0(c0 INT PRIMARY KEY);
CREATE TABLE t1(c0 INT PRIMARY KEY);
DELIMITER //
CREATE PROCEDURE batch_insert_numbers()
BEGIN
DECLARE i INT DEFAULT 1;
DECLARE batch_size INT DEFAULT 1000;
WHILE i <= 1000000 DO
START TRANSACTION;
WHILE i <= 1000000 AND batch_size > 0 DO
INSERT INTO t0 VALUES (i);
INSERT INTO t1 VALUES (i);
SET i = i + 1;
SET batch_size = batch_size - 1;
END WHILE;
COMMIT;
SET batch_size = 1000;
END WHILE;
END //
DELIMITER ;
-- Insert into t0 and t1 with 1,000,000 rows via procedure
CALL batch_insert_numbers();
2. Execute query
-- negative case
EXPLAIN ANALYZE SELECT * FROM t0, t1 WHERE (t0.c0, t1.c0) IN ((1, 1), (2, 2)) \G
*************************** 1. row ***************************
EXPLAIN: -> Filter: ((t0.c0,t1.c0) in ((1,1),(2,2))) (cost=200178 rows=399600) (actual time=2.28..307 rows=2 loops=1)
-> Inner hash join (no condition) (cost=200178 rows=399600) (actual time=2.28..201 rows=2e+6 loops=1)
-> Covering index scan on t1 using PRIMARY (cost=10178 rows=999000) (actual time=2.23..111 rows=1e+6 loops=1)
-> Hash
-> Covering index range scan on t0 using PRIMARY over (c0 = 1) OR (c0 = 2) (cost=0.91 rows=2) (actual time=0.0333..0.0361 rows=2 loops=1)
1 row in set (0.308 sec)
-- positiva
EXPLAIN ANALYZE SELECT * FROM t0, t1 WHERE (t0.c0 = 1 AND t1.c0 = 1) OR (t0.c0 = 2 AND t1.c0 = 2) \G
*************************** 1. row ***************************
EXPLAIN: -> Filter: (((t1.c0 = 1) and (t0.c0 = 1)) or ((t1.c0 = 2) and (t0.c0 = 2))) (cost=2.73 rows=4) (actual time=0.0551..0.058 rows=2 loops=1)
-> Inner hash join (no condition) (cost=2.73 rows=4) (actual time=0.0541..0.0566 rows=4 loops=1)
-> Covering index range scan on t1 using PRIMARY over (c0 = 1) OR (c0 = 2) (cost=0.81 rows=2) (actual time=0.00776..0.00937 rows=2 loops=1)
-> Hash
-> Filter: ((t0.c0 = 1) or (t0.c0 = 2)) (cost=0.91 rows=2) (actual time=0.0349..0.0384 rows=2 loops=1)
-> Covering index range scan on t0 using PRIMARY over (c0 = 1) OR (c0 = 2) (cost=0.91 rows=2) (actual time=0.0333..0.0363 rows=2 loops=1)
1 row in set (0.001 sec)
Suggested fix:
Decompose multi-table row-value IN conditions into an OR-of-AND form early enough in optimization to allow index range scans on all involved tables, avoiding the current cross-join and filter approach.
Description: Hi, MySQL developers, thanks for reading my report. I find a missed optimization in MySQL. When a WHERE clause uses (t0.c0, t1.c0) IN ((1,1), (2,2)), the MySQL optimizer fails to push the individual column-value conditions down to the respective tables. Instead, it generates a cross join (inner hash join with no join condition) between a small filtered set from one table and the entire other table, then applies the row-value filter on the huge intermediate result. This causes a full index scan of the second table, even when the constants would allow an efficient index range scan on that table. The logically equivalent condition (t0.c0 = 1 AND t1.c0 = 1) OR (t0.c0 = 2 AND t1.c0 = 2) produces the expected plan: both tables are accessed via range scans, the join produces only 4 rows, and the filter is trivially applied. Such predicates are a widespread, natural way to filter joins by paired values (product‑category IDs, user‑role pairs, etc.) and should be optimized natively by the engine. How to repeat: 1. Create table and insert data CREATE TABLE t0(c0 INT PRIMARY KEY); CREATE TABLE t1(c0 INT PRIMARY KEY); DELIMITER // CREATE PROCEDURE batch_insert_numbers() BEGIN DECLARE i INT DEFAULT 1; DECLARE batch_size INT DEFAULT 1000; WHILE i <= 1000000 DO START TRANSACTION; WHILE i <= 1000000 AND batch_size > 0 DO INSERT INTO t0 VALUES (i); INSERT INTO t1 VALUES (i); SET i = i + 1; SET batch_size = batch_size - 1; END WHILE; COMMIT; SET batch_size = 1000; END WHILE; END // DELIMITER ; -- Insert into t0 and t1 with 1,000,000 rows via procedure CALL batch_insert_numbers(); 2. Execute query -- negative case EXPLAIN ANALYZE SELECT * FROM t0, t1 WHERE (t0.c0, t1.c0) IN ((1, 1), (2, 2)) \G *************************** 1. row *************************** EXPLAIN: -> Filter: ((t0.c0,t1.c0) in ((1,1),(2,2))) (cost=200178 rows=399600) (actual time=2.28..307 rows=2 loops=1) -> Inner hash join (no condition) (cost=200178 rows=399600) (actual time=2.28..201 rows=2e+6 loops=1) -> Covering index scan on t1 using PRIMARY (cost=10178 rows=999000) (actual time=2.23..111 rows=1e+6 loops=1) -> Hash -> Covering index range scan on t0 using PRIMARY over (c0 = 1) OR (c0 = 2) (cost=0.91 rows=2) (actual time=0.0333..0.0361 rows=2 loops=1) 1 row in set (0.308 sec) -- positiva EXPLAIN ANALYZE SELECT * FROM t0, t1 WHERE (t0.c0 = 1 AND t1.c0 = 1) OR (t0.c0 = 2 AND t1.c0 = 2) \G *************************** 1. row *************************** EXPLAIN: -> Filter: (((t1.c0 = 1) and (t0.c0 = 1)) or ((t1.c0 = 2) and (t0.c0 = 2))) (cost=2.73 rows=4) (actual time=0.0551..0.058 rows=2 loops=1) -> Inner hash join (no condition) (cost=2.73 rows=4) (actual time=0.0541..0.0566 rows=4 loops=1) -> Covering index range scan on t1 using PRIMARY over (c0 = 1) OR (c0 = 2) (cost=0.81 rows=2) (actual time=0.00776..0.00937 rows=2 loops=1) -> Hash -> Filter: ((t0.c0 = 1) or (t0.c0 = 2)) (cost=0.91 rows=2) (actual time=0.0349..0.0384 rows=2 loops=1) -> Covering index range scan on t0 using PRIMARY over (c0 = 1) OR (c0 = 2) (cost=0.91 rows=2) (actual time=0.0333..0.0363 rows=2 loops=1) 1 row in set (0.001 sec) Suggested fix: Decompose multi-table row-value IN conditions into an OR-of-AND form early enough in optimization to allow index range scans on all involved tables, avoiding the current cross-join and filter approach.