Bug #104186 Too few chunk files are created for hybrid hash join
Submitted: 1 Jul 2021 15:04 Modified: 21 Sep 2021 23:01
Reporter: Øystein Grøvlen Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S5 (Performance)
Version:8.0 OS:Any
Assigned to: CPU Architecture:Any

[1 Jul 2021 15:04] Øystein Grøvlen
Description:
For some arbitrary sizes for the join buffer, the number of chunk files created for hash join is too small.  This means that each file will contain more rows that can fit in the join buffer.  When this happens the probe chunks will have to be read multiple times.

One example is with the following query:

select count(*) from sbtest1 t1 join sbtest1 t2 on t1.c = t2.c;

When sbtest1 has 10000 rows, and we set the join buffer size to 768 kB, we see from performance schema that more data is read from chunk files than is written to them:

mysql [localhost:8025] {msandbox} (sbtest) > SELECT event_name, count_write, count_read, FORMAT_PICO_TIME(avg_timer_write) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_write) "Bytes Read", FORMAT_PICO_TIME(avg_timer_read) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_read) "Bytes Read" FROM performance_schema.file_summary_by_event_name WHERE event_name = "wait/io/file/sql/hash_join";
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
| event_name                 | count_write | count_read | Avg Read Time | Bytes Read | Avg Read Time | Bytes Read |
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
| wait/io/file/sql/hash_join |          31 |         51 | 52.33 us      | 1.86 MiB   | 8.21 us       | 3.09 MiB   |
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
1 row in set (0.00 sec)

By decreasing (!!!) the join buffer to 512 kB, we avoid this:

mysql [localhost:8025] {msandbox} (sbtest) > SELECT event_name, count_write, count_read, FORMAT_PICO_TIME(avg_timer_write) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_write) "Bytes Read", FORMAT_PICO_TIME(avg_timer_read) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_read) "Bytes Read" FROM performance_schema.file_summary_by_event_name WHERE event_name = "wait/io/file/sql/hash_join";
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
| event_name                 | count_write | count_read | Avg Read Time | Bytes Read | Avg Read Time | Bytes Read |
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
| wait/io/file/sql/hash_join |          34 |         34 | 50.09 us      | 2.04 MiB   | 9.50 us       | 2.04 MiB   |
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
1 row in set (0.00 sec)

How to repeat:
First create a sysbench table with 10000 rows:

sysbench --mysql-user=msandbox --mysql-password=msandbox --table-size=10000 --mysql-socket=/tmp/mysql_sandbox8025.sock oltp_read_only.lua  prepare

set join_buffer_size= 768*1024;
truncate performance_schema.file_summary_by_event_name;
select count(*) from sbtest1 t1 join sbtest1 t2 on t1.c = t2.c;

SELECT event_name, count_write, count_read, FORMAT_PICO_TIME(avg_timer_write) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_write) "Bytes Read", FORMAT_PICO_TIME(avg_timer_read) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_read) "Bytes Read" FROM performance_schema.file_summary_by_event_name WHERE event_name = "wait/io/file/sql/hash_join";

set join_buffer_size= 512*1024;
truncate performance_schema.file_summary_by_event_name;
select count(*) from sbtest1 t1 join sbtest1 t2 on t1.c = t2.c;

SELECT event_name, count_write, count_read, FORMAT_PICO_TIME(avg_timer_write) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_write) "Bytes Read", FORMAT_PICO_TIME(avg_timer_read) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_read) "Bytes Read" FROM performance_schema.file_summary_by_event_name WHERE event_name = "wait/io/file/sql/hash_join";

Suggested fix:
The cause for the issue is that by mistake integer division is used when computing how many chunk files are needed.  A patch will be provided.
[1 Jul 2021 15:08] Øystein Grøvlen
Use floating point division when deciding the number of chunk files for hash join

(*) I confirm the code being submitted is offered under the terms of the OCA, and that I am authorized to contribute it.

Contribution: hj.diff (application/octet-stream, text), 778 bytes.

[2 Jul 2021 11:21] MySQL Verification Team
Hi Mr. Øystein Grøvlen,

Thank you, very much, for your bug report.

We have managed to repeat the behaviour that you reported, by also using sysbench tool.

Verified as reported.

Thank you, so much, for your contribution.
[21 Sep 2021 23:01] Jon Stephens
Documented fix as follows in the MySQL 8.0.28 changelog:

    When the join buffer was set to certain arbitrary sizes, the
    number of chunk files created for hash joins was too small. This
    meant that each file contained more rows than could fit in the
    join buffer, so that the probe chunks needed to be read multiple
    times. This was caused by using integer division when computing
    how many chunk files are needed; we fix this by using
    floating-point division instead.

    Our thanks to Øystein Grøvlen for the contribution.

Closed.
[22 Sep 2021 12:29] MySQL Verification Team
Thank you, Jon.