Description:
For some arbitrary sizes for the join buffer, the number of chunk files created for hash join is too small. This means that each file will contain more rows that can fit in the join buffer. When this happens the probe chunks will have to be read multiple times.
One example is with the following query:
select count(*) from sbtest1 t1 join sbtest1 t2 on t1.c = t2.c;
When sbtest1 has 10000 rows, and we set the join buffer size to 768 kB, we see from performance schema that more data is read from chunk files than is written to them:
mysql [localhost:8025] {msandbox} (sbtest) > SELECT event_name, count_write, count_read, FORMAT_PICO_TIME(avg_timer_write) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_write) "Bytes Read", FORMAT_PICO_TIME(avg_timer_read) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_read) "Bytes Read" FROM performance_schema.file_summary_by_event_name WHERE event_name = "wait/io/file/sql/hash_join";
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
| event_name | count_write | count_read | Avg Read Time | Bytes Read | Avg Read Time | Bytes Read |
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
| wait/io/file/sql/hash_join | 31 | 51 | 52.33 us | 1.86 MiB | 8.21 us | 3.09 MiB |
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
1 row in set (0.00 sec)
By decreasing (!!!) the join buffer to 512 kB, we avoid this:
mysql [localhost:8025] {msandbox} (sbtest) > SELECT event_name, count_write, count_read, FORMAT_PICO_TIME(avg_timer_write) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_write) "Bytes Read", FORMAT_PICO_TIME(avg_timer_read) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_read) "Bytes Read" FROM performance_schema.file_summary_by_event_name WHERE event_name = "wait/io/file/sql/hash_join";
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
| event_name | count_write | count_read | Avg Read Time | Bytes Read | Avg Read Time | Bytes Read |
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
| wait/io/file/sql/hash_join | 34 | 34 | 50.09 us | 2.04 MiB | 9.50 us | 2.04 MiB |
+----------------------------+-------------+------------+---------------+------------+---------------+------------+
1 row in set (0.00 sec)
How to repeat:
First create a sysbench table with 10000 rows:
sysbench --mysql-user=msandbox --mysql-password=msandbox --table-size=10000 --mysql-socket=/tmp/mysql_sandbox8025.sock oltp_read_only.lua prepare
set join_buffer_size= 768*1024;
truncate performance_schema.file_summary_by_event_name;
select count(*) from sbtest1 t1 join sbtest1 t2 on t1.c = t2.c;
SELECT event_name, count_write, count_read, FORMAT_PICO_TIME(avg_timer_write) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_write) "Bytes Read", FORMAT_PICO_TIME(avg_timer_read) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_read) "Bytes Read" FROM performance_schema.file_summary_by_event_name WHERE event_name = "wait/io/file/sql/hash_join";
set join_buffer_size= 512*1024;
truncate performance_schema.file_summary_by_event_name;
select count(*) from sbtest1 t1 join sbtest1 t2 on t1.c = t2.c;
SELECT event_name, count_write, count_read, FORMAT_PICO_TIME(avg_timer_write) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_write) "Bytes Read", FORMAT_PICO_TIME(avg_timer_read) "Avg Read Time", FORMAT_BYTES(sum_number_of_bytes_read) "Bytes Read" FROM performance_schema.file_summary_by_event_name WHERE event_name = "wait/io/file/sql/hash_join";
Suggested fix:
The cause for the issue is that by mistake integer division is used when computing how many chunk files are needed. A patch will be provided.