MySQL Bugs: #71189: Manual does not provide enough details on how loose index scan really works

Bug #71189	Manual does not provide enough details on how loose index scan really works
Submitted:	20 Dec 2013 14:56	Modified:	20 Dec 2013 20:02
Reporter:	Valeriy Kravchuk	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S3 (Non-critical)
Version:	5.5, any	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any
Tags:	GROUP BY, loose index scan, manual, Optimizer

Description:
MySQL Manual (http://dev.mysql.com/doc/refman/5.5/en/group-by-optimization.html or any other pages) does not present enough details on how loose index scan really works. It says something like this:

"When there is no WHERE clause, a loose index scan reads as many keys as the number of groups, which may be a much smaller number than that of all keys. If the WHERE clause contains range predicates (see the discussion of the range join type in Section 8.8.1, “Optimizing Queries with EXPLAIN”), a loose index scan looks up the first key of each group that satisfies the range conditions, and again reads the least possible number of keys."

But it does not explain even for the simplest examples what exact key look ups are performed and why. It also does not explain how decision is made by optimizer to use loose index scan vs tight index scan.

How to repeat:
Try to explain the following results based on the manual:

C:\Program Files\MySQL\MySQL Server 5.6\bin>mysql -uroot -proot -P3312 test
Warning: Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 13
Server version: 5.5.33 MySQL Community Server (GPL)

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> show create table tgb\G
*************************** 1. row ***************************
       Table: tgb
Create Table: CREATE TABLE `tgb` (
  `id1` int(11) NOT NULL DEFAULT '0',
  `id2` int(11) NOT NULL DEFAULT '0',
  `c1` int(11) DEFAULT NULL,
  PRIMARY KEY (`id1`,`id2`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.03 sec)

mysql> select count(*) from tgb;
+----------+
| count(*) |
+----------+
|     8192 |
+----------+
1 row in set (0.04 sec)

mysql> analyze table tgb;
+----------+---------+----------+----------+
| Table    | Op      | Msg_type | Msg_text |
+----------+---------+----------+----------+
| test.tgb | analyze | status   | OK       |
+----------+---------+----------+----------+
1 row in set (0.02 sec)

mysql> explain select id1, max(id2) from tgb where id2<10000 group by id1 having
 id1=6783;
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+---------------------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  |
rows | Extra                                 |
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+---------------------------------------+
|  1 | SIMPLE      | tgb   | range | NULL          | PRIMARY | 8       | NULL |
8111 | Using where; Using index for group-by |
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+---------------------------------------+
1 row in set (0.03 sec)

mysql> flush status;
Query OK, 0 rows affected (0.00 sec)

mysql> select id1, max(id2) from tgb where id2<10000 group by id1 having id1=678
3;
+------+----------+
| id1  | max(id2) |
+------+----------+
| 6783 |     9971 |
+------+----------+
1 row in set (0.28 sec)

mysql> show session status like 'Handler_read%';
+-----------------------+-------+
| Variable_name         | Value |
+-----------------------+-------+
| Handler_read_first    | 1     |
| Handler_read_key      | 10976 |
| Handler_read_last     | 1     |
| Handler_read_next     | 0     |
| Handler_read_prev     | 0     |
| Handler_read_rnd      | 0     |
| Handler_read_rnd_next | 0     |
+-----------------------+-------+
7 rows in set (0.05 sec)

So, why in the table of only 8192 rows we had 10976 key lookups? What that first and last key reads were used for?

Suggested fix:
Describe all the details of loose index scan procedure based on complete, simple example. 

Describe cost estimations (and/or heuristics) used by optimizer to decide on loose vs tight index scan when both are possible.

tgb.sql dump to load

Attachment: tgb.sql (application/octet-stream, text), 112.05 KiB.

More things to explain. Load attached file:

C:\Program Files\MySQL\MySQL Server 5.6\bin>mysql -uroot -proot -P3312 test <p:\
percona\tgb.sql
Warning: Using a password on the command line interface can be insecure.

and then do this:

C:\Program Files\MySQL\MySQL Server 5.6\bin>mysql -uroot -proot -P3312 test
Warning: Using a password on the command line interface can be insecure.
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 19
Server version: 5.5.33 MySQL Community Server (GPL)

Copyright (c) 2000, 2013, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> delete from tgb where id1 > 1000 and id1 <>6783;
Query OK, 7118 rows affected (0.39 sec)

mysql> select count(*) from tgb where id1=6783;
+----------+
| count(*) |
+----------+
|      204 |
+----------+
1 row in set (0.04 sec)

mysql> select count(*) from tgb;
+----------+
| count(*) |
+----------+
|     1074 |
+----------+
1 row in set (0.02 sec)

mysql> select count(distinct id1) from tgb;
+---------------------+
| count(distinct id1) |
+---------------------+
|                 583 |
+---------------------+
1 row in set (0.07 sec)

mysql> analyze table tgb;
+----------+---------+----------+----------+
| Table    | Op      | Msg_type | Msg_text |
+----------+---------+----------+----------+
| test.tgb | analyze | status   | OK       |
+----------+---------+----------+----------+
1 row in set (0.00 sec)

mysql> explain select sql_no_cache id1, max(id2) from tgb where id1=6783 and id2
<7000 group by id1;
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  |
rows | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+--------------------------+
|  1 | SIMPLE      | tgb   | range | PRIMARY       | PRIMARY | 8       | NULL |
 140 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+--------------------------+
1 row in set (0.04 sec)

So, here tight index scan is used, and it's expensive enough (check). Now:

mysql> delete from tgb where id1 > 100 and id1 <>6783;
Query OK, 784 rows affected (0.10 sec)

mysql> select count(*) from tgb where id1=6783;
+----------+
| count(*) |
+----------+
|      204 |
+----------+
1 row in set (0.04 sec)

mysql> select count(*) from tgb;
+----------+
| count(*) |
+----------+
|      290 |
+----------+
1 row in set (0.03 sec)

mysql> select count(distinct id1) from tgb;
+---------------------+
| count(distinct id1) |
+---------------------+
|                  63 |
+---------------------+
1 row in set (0.06 sec)

The data we are selecting are the same, just table got smaller, with less different id1 values. Now:

mysql> explain select sql_no_cache id1, max(id2) from tgb where id1=6783 and id2
<7000 group by id1;
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+--------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  |
rows | Extra                    |
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+--------------------------+
|  1 | SIMPLE      | tgb   | range | PRIMARY       | PRIMARY | 8       | NULL |
 140 | Using where; Using index |
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+--------------------------+
1 row in set (0.00 sec)

Still loose index scan is NOT used

mysql> alter table tgb engine=InnoDB;
Query OK, 290 rows affected (0.68 sec)
Records: 290  Duplicates: 0  Warnings: 0

(or just ANALYZE TABLE to get some new statistics), and:

mysql> explain select sql_no_cache id1, max(id2) from tgb where id1=6783 and id2
<7000 group by id1;
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+---------------------------------------+
| id | select_type | table | type  | possible_keys | key     | key_len | ref  |
rows | Extra                                 |
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+---------------------------------------+
|  1 | SIMPLE      | tgb   | range | PRIMARY       | PRIMARY | 8       | NULL |
  70 | Using where; Using index for group-by |
+----+-------------+-------+-------+---------------+---------+---------+------+-
-----+---------------------------------------+
1 row in set (0.04 sec)

mysql> flush status;
Query OK, 0 rows affected (0.00 sec)

mysql> select sql_no_cache id1, max(id2) from tgb where id1=6783 and id2<7000 gr
oup by id1;
+------+----------+
| id1  | max(id2) |
+------+----------+
| 6783 |     6918 |
+------+----------+
1 row in set (0.00 sec)

mysql> show session status like 'Handler_%';
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 1     |
| Handler_delete             | 0     |
| Handler_discover           | 0     |
| Handler_prepare            | 0     |
| Handler_read_first         | 0     |
| Handler_read_key           | 4     |
| Handler_read_last          | 1     |
| Handler_read_next          | 0     |
| Handler_read_prev          | 0     |
| Handler_read_rnd           | 0     |
| Handler_read_rnd_next      | 0     |
| Handler_rollback           | 0     |
| Handler_savepoint          | 0     |
| Handler_savepoint_rollback | 0     |
| Handler_update             | 0     |
| Handler_write              | 0     |
+----------------------------+-------+
16 rows in set (0.04 sec)

Now loose index scan is used and it's super efficient. Ideally manual should explain both Handler_% values above and reasons why the same loose index scan path was NOT used initially.

Thank you for the reasonable documentation request.