Description:
Hi,
8.0.20 release notes states the followings (https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-20.html#mysqld-8-0-20-optimizer):
Hash joins are now used any time a nested block loop would be employed. This means that hash joins can be used for the following types of queries:
Inner non-equi-joins
Semijoins
Antijoins
Left outer joins
Right outer joins
In the worklog we can also see this with examples (https://dev.mysql.com/worklog/task/?id=13377),example:
Semijoin
EXPLAIN FORMAT=tree SELECT * FROM t1 WHERE (t1.i) IN (SELECT t2.i FROM t2);
-> Hash semijoin (t2.i = t1.i) (cost=0.83 rows=2)
-> Table scan on t1 (cost=0.45 rows=2)
-> Hash
-> Table scan on t2 (cost=0.18 rows=2)
Antijoin
EXPLAIN FORMAT=tree
SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.col1 = t2.col1);
-> Hash anti-join (t1.col1 = t2.col1) (cost=1.15 rows=6)
-> Table scan on t2 (cost=0.55 rows=3)
-> Hash
-> Table scan on t1 (cost=0.30 rows=2)
So I assume in 8.0.20 Semijon and Antijoins should use Hash Join. But in my tests they are still using Nested Loop.
Also if we check the documentation:
https://dev.mysql.com/doc/refman/8.0/en/hash-joins.html
It lists the queries where Hash Join can be used. Semijoin and AntiJoin is in the list but in the example it shows nested loop instead of Hash Join in the documentation as well.
How to repeat:
create database test;
use test;
CREATE TABLE `t1` (
`id` int NOT NULL AUTO_INCREMENT,
`k` int NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
CREATE TABLE `t2` (
`id` int NOT NULL AUTO_INCREMENT,
`k` int NOT NULL DEFAULT '0',
PRIMARY KEY (`id`)
) ENGINE=InnoDB;
insert into t1 (k) values (5);
insert into t1 (k) values (6);
insert into t1 (k) values (7);
insert into t1 (k) values (8);
insert into t1 (k) values (9);
insert into t1 (k) values (10);
insert into t2 (k) values (5);
insert into t2 (k) values (6);
insert into t2 (k) values (7);
insert into t2 (k) values (8);
insert into t2 (k) values (9);
insert into t2 (k) values (10);
mysql> show global variables like '%vers%';
+-------------------------+------------------------------+
| Variable_name | Value |
+-------------------------+------------------------------+
| innodb_version | 8.0.20 |
| protocol_version | 10 |
| slave_type_conversions | |
| tls_version | TLSv1,TLSv1.1,TLSv1.2 |
| version | 8.0.20 |
| version_comment | MySQL Community Server - GPL |
| version_compile_machine | x86_64 |
| version_compile_os | Linux |
| version_compile_zlib | 1.2.11 |
+-------------------------+------------------------------+
9 rows in set (0.01 sec)
Optimizer settings:
| optimizer_switch | index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,use_invisible_indexes=off,skip_scan=on,hash_join=on |
Semijoin:
EXPLAIN FORMAT=tree SELECT * FROM t1 WHERE (t1.k) IN (SELECT t2.k FROM t2);
| -> Nested loop inner join
-> Filter: (t1.k is not null) (cost=0.85 rows=6)
-> Table scan on t1 (cost=0.85 rows=6)
-> Single-row index lookup on <subquery2> using <auto_distinct_key> (k=t1.k)
-> Materialize with deduplication
-> Table scan on t2 (cost=0.85 rows=6)
|
Antijoin:
EXPLAIN FORMAT=tree
SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.k = t2.k);
| -> Nested loop antijoin
-> Table scan on t2 (cost=0.85 rows=6)
-> Single-row index lookup on <subquery2> using <auto_distinct_key> (k=t2.k)
-> Materialize with deduplication
-> Table scan on t1 (cost=0.85 rows=6)
|
As we can see it is still using Nested Loop for Semijoin and Antijoin.
However Left outer Join and Inner non equi-joins are using Hash Join as expected.
Left outer join:
EXPLAIN FORMAT=tree SELECT * FROM t1 LEFT JOIN t2 ON t1.k = t2.k;
| -> Left hash join (t2.k = t1.k) (cost=3.99 rows=36)
-> Table scan on t1 (cost=0.85 rows=6)
-> Hash
-> Table scan on t2 (cost=0.14 rows=6)
|
Inner non equi-join:
EXPLAIN FORMAT=tree SELECT * FROM t1 JOIN t2 ON t1.k < t2.k;
| -> Filter: (t1.k < t2.k) (cost=4.70 rows=12)
-> Inner hash join (no condition) (cost=4.70 rows=12)
-> Table scan on t2 (cost=0.07 rows=6)
-> Hash
-> Table scan on t1 (cost=0.85 rows=6)
|
Suggested fix:
So the point here the release notes,worklog and the documentations are contradicting each other.
Option 1: SemiJoin and AntiJoin is not supported in Hash Join, in that case Release Notes should be updated and also documentation should be more clear.
Option 2: SemiJoin and Antijoin should be working with Hash Join, in that case documentation should be updated and more importantly this is a bug and should be fixed.