MySQL Bugs: #74744: EXPLAIN FORMAT=JSON produces duplicates

Bug #74744	EXPLAIN FORMAT=JSON produces duplicates_removal where there is none
Submitted:	7 Nov 2014 22:45	Modified:	16 Jun 2016 20:45
Reporter:	Sergey Petrunya	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Optimizer	Severity:	S3 (Non-critical)
Version:	5.6.21, 5.6.22, 5.7.6	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	explain json

Description:
There are cases where EXPLAIN FORMAT=JSON produces "duplicates_removal", while no duplicate removal is taking place. 

This issue is similar (but not the same as) BUG#74462, which is about ordering_operation when no ordering takes place.

How to repeat:
Consider this example:

create table ten(a int);
insert into ten values (0),(1),(2),(3),(4),(5),(6),(7),(8),(9);

create table one_k(a int);
insert into one_k select A.a + B.a* 10 + C.a * 100 from ten A, ten B, ten C;

CREATE TABLE tpk (
  pk int(11) NOT NULL,
  a int(11) DEFAULT NULL,
  b int(11) DEFAULT NULL,
  PRIMARY KEY (pk),
  KEY a (a)
);

insert into tpk select a,a,a from one_k;

explain select distinct a  from tpk where a<10;
explain select distinct pk from tpk where pk<10;

explain format=json select distinct a  from tpk where a<10\G
explain format=json select distinct pk from tpk where pk<10\G

MySQL [j1]> explain format=json select distinct a  from tpk where a<10\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "duplicates_removal": {
      "using_filesort": false,
      "table": {
        "table_name": "tpk",
        "access_type": "range",
        "possible_keys": [
          "a"
        ],
        "key": "a",
        "used_key_parts": [
          "a"
        ],
        "key_length": "5",
        "rows": 9,
        "filtered": 100,
        "using_index": true,
        "attached_condition": "(`j1`.`tpk`.`a` < 10)"
      }
    }
  }
}
1 row in set, 1 warning (0.00 sec)

MySQL [j1]> explain format=json select distinct pk from tpk where pk<10\G
*************************** 1. row ***************************
EXPLAIN: {
  "query_block": {
    "select_id": 1,
    "duplicates_removal": {
      "using_filesort": false,
      "table": {
        "table_name": "tpk",
        "access_type": "range",
        "possible_keys": [
          "PRIMARY",
          "a"
        ],
        "key": "PRIMARY",
        "used_key_parts": [
          "pk"
        ],
        "key_length": "4",
        "rows": 10,
        "filtered": 100,
        "using_index": true,
        "attached_condition": "(`j1`.`tpk`.`pk` < 10)"
      }
    }
  }
}
1 row in set, 1 warning (0.01 sec)

Both queries show "duplicates_removal", they look like they are using the same execution plans (except for the name of used indexes).  

In reality, their execution is different
* when using PK, the optimizer figures that primary key values are already distinct, so no duplicate_removal takes place.
* when using non-unique index "a", index scan may produce duplicates which are removed "on the fly" by comparing each row with the previous one. If it's equal, it's removed. 

Suggested fix:
Make EXPLAIN JSON show whether/how duplicates are removed.

Hello Sergey,

Thank you for the report and test case.

Thanks,
Umesh

test results

Attachment: 74744.txt (text/plain), 9.87 KiB.

Still repeatable on 5.7.12