Bug #89051 EXPLAIN output is different for same content depending when index was added
Submitted: 25 Dec 2017 19:06 Modified: 26 Dec 2017 11:17
Reporter: Sveta Smirnova (OCA) Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.7.20 OS:Any
Assigned to: CPU Architecture:Any

[25 Dec 2017 19:06] Sveta Smirnova
Description:
EXPLAIN FORMAT=JSON has different query_cost, rows_examined_per_scan, rows_produced_per_join, cost_info for the same table definition and data, depending when index was added. Output of SHOW STATUS LIKE 'Handler%' is same for both cases.

How to repeat:
CREATE TABLE `t1` (
`f1` varchar(20) NOT NULL,
`f2` varchar(100) NOT NULL DEFAULT '',
`f3` bigint(20) NOT NULL,
PRIMARY KEY (`f1`,`f3`,`f2`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

insert into t1 values(encrypt(uuid()), uuid(),  (FLOOR( 1 + RAND( ) * 60000 )));
insert into t1 select encrypt(uuid()), uuid(),  (FLOOR( 1 + RAND( ) * 60000 )) from t1;
--repeat 16 times

update t1 set f1 = f3%100;

explain format=json  select aa.f2 , max(aa.f3) as f3 from t1 aa  where f1 =  '42' and f3 <= 30000 group by aa.f2;

EXPLAIN	{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "180.83"
    },
    "grouping_operation": {
      "using_temporary_table": true,
      "using_filesort": true,
      "table": {
        "table_name": "aa",
        "access_type": "range",
        "possible_keys": [
          "PRIMARY"
        ],
        "key": "PRIMARY",
        "used_key_parts": [
          "f1",
          "f3"
        ],
        "key_length": "70",
        "rows_examined_per_scan": 635,
        "rows_produced_per_join": 211,
        "filtered": "100.00",
        "using_index": true,
        "cost_info": {
          "read_cost": "53.83",
          "eval_cost": "42.33",
          "prefix_cost": "180.83",
          "data_read_per_join": "77K"
        },
        "used_columns": [
          "f1",
          "f2",
          "f3"
        ],
        "attached_condition": "((`test`.`aa`.`f1` = '42') and (`test`.`aa`.`f3` <= 30000))"
      }
    }
  }
}
Warnings:
Level	Note
Code	1003
Message	/* select#1 */ select `test`.`aa`.`f2` AS `f2`,max(`test`.`aa`.`f3`) AS `f3` from `test`.`t1` `aa` where ((`test`.`aa`.`f1` = '42') and (`test`.`aa`.`f3` <= 30000)) group by `test`.`aa`.`f2`

-- Check specified values. Looks pretty effective, but in fact is not:

flush status;
select aa.f2 , max(aa.f3) as f3 from t1 aa  where f1 =  '42' and f3 <= 30000 group by aa.f2;

show status like 'Handler%';
Variable_name	Value
Handler_commit	1
Handler_delete	0
Handler_discover	0
Handler_external_lock	2
Handler_mrr_init	0
Handler_prepare	0
Handler_read_first	0
Handler_read_key	636
Handler_read_last	0
Handler_read_next	635
Handler_read_prev	0
Handler_read_rnd	635
Handler_read_rnd_next	636
Handler_rollback	0
Handler_savepoint	0
Handler_savepoint_rollback	0
Handler_update	0
Handler_write	635

Now lets create dummy table, truncate original one, drop the key, insert rows first and add key back.

create table t2 like t1;
insert into t2 select * from t1;
alter table t1 drop primary key;
truncate table t1;
insert into t1 select * from t2;
alter table t1 add primary key(`f1`,`f3`,`f2`);

Now lets check how output of EXPLAIN FORMAT=JSON changed:

explain format=json  select aa.f2 , max(aa.f3) as f3 from t1 aa  where f1 =  '42' and f3 <= 30000 group by aa.f2;
EXPLAIN	{
  "query_block": {
    "select_id": 1,
    "cost_info": {
      "query_cost": "890.67"
    },
    "grouping_operation": {
      "using_temporary_table": true,
      "using_filesort": true,
      "cost_info": {
        "sort_cost": "635.00"
      },
      "table": {
        "table_name": "aa",
        "access_type": "range",
        "possible_keys": [
          "PRIMARY"
        ],
        "key": "PRIMARY",
        "used_key_parts": [
          "f1",
          "f3"
        ],
        "key_length": "70",
        "rows_examined_per_scan": 635,
        "rows_produced_per_join": 635,
        "filtered": "100.00",
        "using_index": true,
        "cost_info": {
          "read_cost": "128.67",
          "eval_cost": "127.00",
          "prefix_cost": "255.67",
          "data_read_per_join": "233K"
        },
        "used_columns": [
          "f1",
          "f2",
          "f3"
        ],
        "attached_condition": "((`test`.`aa`.`f1` = '42') and (`test`.`aa`.`f3` <= 30000))"
      }
    }
  }
}
Warnings:
Level	Note
Code	1003
Message	/* select#1 */ select `test`.`aa`.`f2` AS `f2`,max(`test`.`aa`.`f3`) AS `f3` from `test`.`t1` `aa` where ((`test`.`aa`.`f1` = '42') and (`test`.`aa`.`f3` <= 30000)) group by `test`.`aa`.`f2`

-- And test real execution path again.

flush status;
select aa.f2 , max(aa.f3) as f3 from t1 aa  where f1 =  '42' and f3 <= 30000 group by aa.f2;
show status like 'Handler%';
Variable_name	Value
Handler_commit	1
Handler_delete	0
Handler_discover	0
Handler_external_lock	2
Handler_mrr_init	0
Handler_prepare	0
Handler_read_first	0
Handler_read_key	636
Handler_read_last	0
Handler_read_next	635
Handler_read_prev	0
Handler_read_rnd	635
Handler_read_rnd_next	636
Handler_rollback	0
Handler_savepoint	0
Handler_savepoint_rollback	0
Handler_update	0
Handler_write	635
[25 Dec 2017 19:07] Sveta Smirnova
test case for MTR

Attachment: explain_json_bug_20171225.test (application/octet-stream, text), 2.68 KiB.

[26 Dec 2017 11:17] MySQL Verification Team
Hi Sveta,

Thanks for the report.

all best
Bogdan
[2 Jan 2018 9:54] Øystein Grøvlen
Hi Sveta,

Always remember to do ANALYZE TABLE after bulk inserts!  If ANALYZE TABLE is done, I get the same query plan and cost in both cases.

That said, your first variant reveals some issues with the displayed cost when statistics are way off.  What happens here is that ref-access is first selected based on the dubious statistics.  Later it is discovered that more key-parts may be used if range access is used instead, and the optimizer decides to use range access instead.  It seems the cost is not updated to reflect that.  It also seems that sort cost is not added for ref access.

Note also that in general, you can not expect same cost regardless of when index was added.  B-tree indexes may often be more compact when indexes are added after the data was inserted than if the index was continuously maintained during insert. Hence, the synopsis as written is not a bug.  However, the cost values for your first scenario shows the following bug:  "Cost values are not correct when optimizer switch from ref-access to range-access in order to use more key parts".