Description:
Based on http://dev.mysql.com/doc/refman/5.6/en/innodb-parameters.html#sysvar_innodb_stats_auto_rec... and other manual pages explaining how persistent statistics works one may assume that if the table is created with persistent statistics and innodb_stats_auto_recalc is ON, like it happens with default settings in MySQL 5.6:
mysql> show variables like 'innodb_stats%';
+--------------------------------------+-------------+
| Variable_name | Value |
+--------------------------------------+-------------+
| innodb_stats_auto_recalc | ON |
| innodb_stats_method | nulls_equal |
| innodb_stats_on_metadata | OFF |
| innodb_stats_persistent | ON |
| innodb_stats_persistent_sample_pages | 20 |
| innodb_stats_sample_pages | 8 |
| innodb_stats_transient_sample_pages | 8 |
+--------------------------------------+-------------+
7 rows in set (0.00 sec)
there is no need to run ANALYZE TABLE after loading a lot of new data into the newly created table - if more than 10% of rows in the table are changed (insert is a change, right?) statistics will be updated properly.
This is not always the case it seems (even though it's not obvious why) and plan based on persistent statistics that is updated may be wrong/non-optimal until explicit ANALYZE TABLE is executed after data load. See "How to repeat" section below.
How to repeat:
mysql> create table ti1(id int auto_increment primary key, c2 char(200)) engine=InnoDB;
Query OK, 0 rows affected (0.34 sec)
mysql> insert into ti1(c2) values ('a');
Query OK, 1 row affected (0.03 sec)
mysql> insert into ti1(c2) select repeat('b',200) from ti1; Query OK, 1 row affected (0.03 sec)
Records: 1 Duplicates: 0 Warnings: 0
...
mysql> insert into ti1(c2) select repeat('b',200) from ti1;
Query OK, 512 rows affected (0.72 sec)
Records: 512 Duplicates: 0 Warnings: 0
mysql> create table ti2(id int auto_increment primary key, c2 char(200)) engine=InnoDB;
Query OK, 0 rows affected (0.38 sec)
mysql> alter table ti2 add key(c2);
Query OK, 0 rows affected (0.44 sec)
Records: 0 Duplicates: 0 Warnings: 0
mysql> insert into ti2 select * from ti1;
Query OK, 1024 rows affected (1.35 sec)
Records: 1024 Duplicates: 0 Warnings: 0
mysql> select * from mysql.innodb_table_stats where table_name like 'ti%'\G
*************************** 1. row ***************************
database_name: test
table_name: ti1
last_update: 2013-10-16 11:03:02
n_rows: 1024
clustered_index_size: 17
sum_of_other_index_sizes: 0
*************************** 2. row ***************************
database_name: test
table_name: ti2
last_update: 2013-10-16 11:06:35
n_rows: 1024
clustered_index_size: 17
sum_of_other_index_sizes: 15
2 rows in set (0.00 sec)
So, we see in the related tables (I checked that index one also, had to remove the output to fit this stupid 8K limit) that persistent statistics for the ti2 table is there and it was updated after INSERT ... SELECT as we expected. Now, let's check the plan for one simple query:
mysql> explain select * from ti1 inner join ti2 on ti1.c2 = ti2.c2 where ti2.id < 10;
+----+-------------+-------+------+---------------+------+---------+-------------+------+--------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+------+---------------+------+---------+-------------+------+--------------------------+
| 1 | SIMPLE | ti1 | ALL | NULL | NULL | NULL | NULL | 1024 | Using where |
| 1 | SIMPLE | ti2 | ref | PRIMARY,c2 | c2 | 201 | test.ti1.c2 | 1 | Using where; Using index |
+----+-------------+-------+------+---------------+------+---------+-------------+------+--------------------------+
2 rows in set (0.00 sec)
For some smart reason optimizer decides, based on existing persistent (I assume) statistics that it makes sense to scan the entire ti1 table and check corresponding rows in ti2 table using index on c2 column (even though join condition is true for for every row in ti1, it's ti2.id that is really a limit).
One can easily check what it means for the runtime and number of access operations:
mysql> flush status;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from ti1 inner join ti2 on ti1.c2 = ti2.c2 where ti2.id < 10;
...
7162 rows in set (0.80 sec)
mysql> show status like 'Handler_read%';
+-----------------------+---------+
| Variable_name | Value |
+-----------------------+---------+
| Handler_read_first | 1 |
| Handler_read_key | 1025 |
| Handler_read_last | 0 |
| Handler_read_next | 1046530 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 1025 |
+-----------------------+---------+
7 rows in set (0.00 sec)
Awful, as expected. Now, what if we run ANALYZE on ti2?
mysql> select now();
+---------------------+
| now() |
+---------------------+
| 2013-10-16 11:14:24 |
+---------------------+
1 row in set (0.00 sec)
mysql> analyze table ti2;
+----------+---------+----------+----------+
| Table | Op | Msg_type | Msg_text |
+----------+---------+----------+----------+
| test.ti2 | analyze | status | OK |
+----------+---------+----------+----------+
1 row in set (0.34 sec)
mysql> select * from mysql.innodb_table_stats where table_name like 'ti%'\G *************************** 1. row ***************************
database_name: test
table_name: ti1
last_update: 2013-10-16 11:03:02
n_rows: 1024
clustered_index_size: 17
sum_of_other_index_sizes: 0
*************************** 2. row ***************************
database_name: test
table_name: ti2
last_update: 2013-10-16 11:14:30
n_rows: 1024
clustered_index_size: 17
sum_of_other_index_sizes: 15
2 rows in set (0.00 sec)
...
Statistics was updated (for me it looks the same though). Now let's check the plan for the same query:
mysql> explain select * from ti1 inner join ti2 on ti1.c2 = ti2.c2 where ti2.id < 10;
+----+-------------+-------+-------+---------------+---------+---------+------+------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+---------+---------+------+------+----------------------------------------------------+
| 1 | SIMPLE | ti2 | range | PRIMARY,c2 | PRIMARY | 4 | NULL | 8 | Using where |
| 1 | SIMPLE | ti1 | ALL | NULL | NULL | NULL | NULL | 1024 | Using where; Using join buffer (Block Nested Loop) |
+----+-------------+-------+-------+---------------+---------+---------+------+------+----------------------------------------------------+
2 rows in set (0.00 sec)
Looks notably more reasonable! Quick check:
mysql> flush status;
Query OK, 0 rows affected (0.00 sec)
mysql> select * from ti1 inner join ti2 on ti1.c2 = ti2.c2 where ti2.id < 10;
...
7162 rows in set (0.01 sec)
mysql> show status like 'Handler_read%';
+-----------------------+-------+
| Variable_name | Value |
+-----------------------+-------+
| Handler_read_first | 2 |
| Handler_read_key | 2 |
| Handler_read_last | 0 |
| Handler_read_next | 8 |
| Handler_read_prev | 0 |
| Handler_read_rnd | 0 |
| Handler_read_rnd_next | 1025 |
+-----------------------+-------+
7 rows in set (0.00 sec)
That Handler_read% values above do show the difference.
Suggested fix:
Describe in the manual that one should still run ANALYZE TABLE after loading data, no mater what value of innodb_stats_auto_recalc he has.
Expose more details about persistent statistics to help troubleshooting reasons of bad query plans.
If there is actually an optimizer bug/known deficiency exposed by the above, report and fix it separately.