MySQL Bugs: #99943: Hash join does not work for Semijoin and Antijoin

Bug #99943	Hash join does not work for Semijoin and Antijoin
Submitted:	20 Jun 2020 20:34	Modified:	17 Feb 2021 14:23
Reporter:	Tibor Korocz	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	8.0.20	OS:	Linux
Assigned to:		CPU Architecture:	Any
Tags:	hash join

Description:
Hi, 

8.0.20 release notes states the followings (https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-20.html#mysqld-8-0-20-optimizer):

Hash joins are now used any time a nested block loop would be employed. This means that hash joins can be used for the following types of queries:
    Inner non-equi-joins
    Semijoins
    Antijoins
    Left outer joins
    Right outer joins

In the worklog we can also see this with examples (https://dev.mysql.com/worklog/task/?id=13377),example: 

Semijoin

EXPLAIN FORMAT=tree SELECT * FROM t1 WHERE (t1.i) IN (SELECT t2.i FROM t2);
-> Hash semijoin (t2.i = t1.i)  (cost=0.83 rows=2)
    -> Table scan on t1  (cost=0.45 rows=2)
    -> Hash
        -> Table scan on t2  (cost=0.18 rows=2)

Antijoin

EXPLAIN FORMAT=tree
  SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.col1 = t2.col1);
-> Hash anti-join (t1.col1 = t2.col1)  (cost=1.15 rows=6)
    -> Table scan on t2  (cost=0.55 rows=3)
    -> Hash
        -> Table scan on t1  (cost=0.30 rows=2)

So I assume in 8.0.20 Semijon and Antijoins should use Hash Join. But in my tests they are still using Nested Loop.

Also if we check the documentation: 
https://dev.mysql.com/doc/refman/8.0/en/hash-joins.html

It lists the queries where Hash Join can be used. Semijoin and AntiJoin is in the list but in the example it shows nested loop instead of Hash Join in the documentation as well.

How to repeat:
create database test;
use test;

CREATE TABLE `t1` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

CREATE TABLE `t2` (
  `id` int NOT NULL AUTO_INCREMENT,
  `k` int NOT NULL DEFAULT '0',
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

insert into t1 (k) values (5);
insert into t1 (k) values (6);
insert into t1 (k) values (7);
insert into t1 (k) values (8);
insert into t1 (k) values (9);
insert into t1 (k) values (10);

insert into t2 (k) values (5);
insert into t2 (k) values (6);
insert into t2 (k) values (7);
insert into t2 (k) values (8);
insert into t2 (k) values (9);
insert into t2 (k) values (10);

mysql> show global variables like '%vers%';
+-------------------------+------------------------------+
| Variable_name           | Value                        |
+-------------------------+------------------------------+
| innodb_version          | 8.0.20                       |
| protocol_version        | 10                           |
| slave_type_conversions  |                              |
| tls_version             | TLSv1,TLSv1.1,TLSv1.2        |
| version                 | 8.0.20                       |
| version_comment         | MySQL Community Server - GPL |
| version_compile_machine | x86_64                       |
| version_compile_os      | Linux                        |
| version_compile_zlib    | 1.2.11                       |
+-------------------------+------------------------------+
9 rows in set (0.01 sec)

Optimizer settings:
| optimizer_switch | index_merge=on,index_merge_union=on,index_merge_sort_union=on,index_merge_intersection=on,engine_condition_pushdown=on,index_condition_pushdown=on,mrr=on,mrr_cost_based=on,block_nested_loop=on,batched_key_access=off,materialization=on,semijoin=on,loosescan=on,firstmatch=on,duplicateweedout=on,subquery_materialization_cost_based=on,use_index_extensions=on,condition_fanout_filter=on,derived_merge=on,use_invisible_indexes=off,skip_scan=on,hash_join=on |

Semijoin:

EXPLAIN FORMAT=tree SELECT * FROM t1 WHERE (t1.k) IN (SELECT t2.k FROM t2);
| -> Nested loop inner join
    -> Filter: (t1.k is not null)  (cost=0.85 rows=6)
        -> Table scan on t1  (cost=0.85 rows=6)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (k=t1.k)
        -> Materialize with deduplication
            -> Table scan on t2  (cost=0.85 rows=6)
 |

Antijoin:

EXPLAIN FORMAT=tree
  SELECT * FROM t2 WHERE NOT EXISTS (SELECT * FROM t1 WHERE t1.k = t2.k);

| -> Nested loop antijoin
    -> Table scan on t2  (cost=0.85 rows=6)
    -> Single-row index lookup on <subquery2> using <auto_distinct_key> (k=t2.k)
        -> Materialize with deduplication
            -> Table scan on t1  (cost=0.85 rows=6)
 |

As we can see it is still using Nested Loop for Semijoin and Antijoin.
However Left outer Join and Inner non equi-joins are using Hash Join as expected.

Left outer join:

EXPLAIN FORMAT=tree SELECT * FROM t1 LEFT JOIN t2 ON t1.k = t2.k;

| -> Left hash join (t2.k = t1.k)  (cost=3.99 rows=36)
    -> Table scan on t1  (cost=0.85 rows=6)
    -> Hash
        -> Table scan on t2  (cost=0.14 rows=6)
 |

Inner non equi-join:

EXPLAIN FORMAT=tree SELECT * FROM t1 JOIN t2 ON t1.k < t2.k;

| -> Filter: (t1.k < t2.k)  (cost=4.70 rows=12)
    -> Inner hash join (no condition)  (cost=4.70 rows=12)
        -> Table scan on t2  (cost=0.07 rows=6)
        -> Hash
            -> Table scan on t1  (cost=0.85 rows=6)
 |

Suggested fix:
So the point here the release notes,worklog and the documentations are contradicting each other.

Option 1: SemiJoin and AntiJoin is not supported in Hash Join, in that case Release Notes should be updated and also documentation should be more clear.

Option 2: SemiJoin and Antijoin should be working with Hash Join, in that case documentation should be updated and more importantly this is a bug and should be fixed.

Hello Tibor Korocz,

Thank you for the report and test case.

regards,
Umesh

As this bug report cites: "Hash joins are now used any time a nested block loop would be employed".  For the examples here, Block Nested Loop (BNL) was not used before the introduction of hash join, so one should not expect hash join to be used.

For the two examples, semijoin materialization and subquery materialization are still used.  These strategies are basically equivalent to hash join since it will build a hash index over a temporary table.

You can see hash join in the explain output if you set optimizer_switch='materialization=off'

[7 Jan 2021 5:38] Steinar Gunderson

As Øystein says: We do support hash join for these joins, but the optimizer
will not always pick it. That is not a bug; where we picked BNL before, we'll
pick hash join now, no more, no less.