Description:
When there is a tinyint column with 0/1 values and only handful of rows contain one of them, optimizer computes the rows extimate for using index on that column as 50% of the rows instead of 95%+ and moreover it actually uses the index instead of a table scan even when there are additional conditions on a non-indexed column (so PK ref access will be needed anyway).
Selecting the "rare" value gives much better estimate (and using index is the right choice in that case).
I was able to repeat it with 5.5 http://sqlfiddle.com/#!2/31d54/1 and 5.6 http://sqlfiddle.com/#!9/31d549/1
I tested it between 15k - 1M rows and did observe it in production on 3.5M table. It seemed to behave correctly (optimizer prefering tablescan) somewhere under 15k rows.
I do not have access to 5.7 currently so cannot verify it there.
How to repeat:
DROP TABLE IF EXISTS skewed_bool;
CREATE TABLE skewed_bool (id INT AUTO_INCREMENT PRIMARY KEY, val INT NOT NULL, flag bool NOT NULL, KEY(flag)) engine=innodb;
INSERT INTO skewed_bool(val, flag) VALUES (5,0),(5,0),(5,1),(5,0),(5,0),(5,0),(5,0),(5,0);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
INSERT INTO skewed_bool(val, flag) SELECT 5, 0 FROM skewed_bool;
INSERT INTO skewed_bool(val, flag) VALUES (5,1);
EXPLAIN SELECT COUNT(1) FROM skewed_bool WHERE val > 1 AND flag = 1;
EXPLAIN SELECT COUNT(1) FROM skewed_bool WHERE val > 1 AND flag = 0;
SHOW INDEX FROM skewed_bool;
analyze TABLE skewed_bool;
SHOW INDEX FROM skewed_bool;
EXPLAIN SELECT COUNT(1) FROM skewed_bool WHERE val > 1 AND flag = 1;
EXPLAIN SELECT COUNT(1) FROM skewed_bool WHERE val > 1 AND flag = 0;
SELECT COUNT(1), flag FROM skewed_bool GROUP BY flag;
Suggested fix:
Observe the "heuristics" 20% rows -> full table scan as done in other cases.