MySQL Bugs: #99934: Hash join adds columns to the hash table that is not needed.

Bug #99934	Hash join adds columns to the hash table that is not needed.
Submitted:	19 Jun 2020 2:32	Modified:	19 Jun 2020 12:20
Reporter:	Øystein Grøvlen	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S5 (Performance)
Version:	8.0.20	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Columns that are only needed for operations preceding a hash join does not need to be inserted into the hash table.  Take the following query:

SELECT /*+ NO_INDEX(orders) NO_INDEX(lineitem) */ COUNT(*)
FROM orders
JOIN lineitem ON l_orderkey = o_orderkey
WHERE o_custkey <= 1000000;

If the hash table is to contain rows from the orders table, the condition on o_custkey will be evaluated before inserting rows into the hash table.  Since this column is not needed for rest of the query execution, there is no need to insert it into the join buffer.

Running this query in the debugger, one will observe that the row_size computed in HashJoinRowBuffer::StoreRow(), see sql/hash_join_buffer.cc:412, is 9 bytes.  This is 4 bytes for o_orderkey and 5 bytes for o_custkey (since it is nullable). Since MEM_ROOT allocations are aligned to 8 bytes, this means that for this query, 16 bytes instead of 8 bytes are used for the payload of each row.

How to repeat:
CREATE TABLE lineitem (
	l_orderkey INTEGER,
	l_linenumber INTEGER,
        PRIMARY KEY (l_orderkey, l_linenumber)
);

INSERT INTO lineitem VALUES
(          1 ,            1 ),
(          1 ,            2 ),
(          1 ,            3 ),
(          1 ,            4 ),
(          1 ,            5 ),
(          1 ,            6 ),
(          2 ,            1 ),
(          3 ,            1 ),
(          3 ,            2 ),
(          3 ,            3 ),
(          3 ,            4 ),
(          3 ,            5 ),
(          3 ,            6 ),
(          4 ,            1 ),
(          5 ,            1 ),
(          5 ,            2 ),
(          5 ,            3 );

CREATE TABLE orders (
	o_orderkey INTEGER PRIMARY KEY,
	o_custkey INTEGER
);

INSERT INTO orders VALUES
(          1 ,    369001 ),
(          2 ,    780017 ),
(          3 ,   1233140 ),
(          4 ,   1367761 ),
(          5 ,    444848 );

# Set a break point at sql/hash_join_buffer.cc:415 and run this query:

SELECT /*+ NO_INDEX(orders) NO_INDEX(lineitem) */ COUNT(*)
FROM orders
JOIN lineitem ON l_orderkey = o_orderkey
WHERE o_custkey <= 1000000;

# Observe the value for row_size

Suggested fix:
Exclude columns that are not needed after the hash join

Hi Øystein Grøvlen,

Thank you for your performance improvement report.

I have analysed carefully the code related to the hash joins. Your analysis is, simply, correct. There is a significant improvement that can be achieved.

Verified as the performance improvement bug.