Bug #45314 Index is not used for joining on a nullable columns with no additional condition
Submitted: 3 Jun 2009 22:46 Modified: 4 Jun 2009 6:25
Reporter: Alex Bolenok Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.1.28-rc-community, 4.1, 5.0, 5.1, 6.0 bzr OS:Any (MS Windows, Linux)
Assigned to: CPU Architecture:Any
Triage: Triaged: D3 (Medium)

[3 Jun 2009 22:46] Alex Bolenok
Description:
When joining two tables on a column that is nullable on one of the tables, index range scan cannot be used (and cannot even be forced) on that table unless you add a condition:

WHERE column IS NOT NULL

How to repeat:
CREATE TABLE t_leading (id INT NOT NULL PRIMARY KEY, selector INT, value TEXT NOT NULL, KEY ix_leading_selector (selector)) ENGINE=InnoDB;
CREATE TABLE t_driven (id INT NOT NULL PRIMARY KEY, value TEXT NOT NULL) Engine=InnoDB;
INSERT
INTO    t_leading (id, selector, value)
VALUES
        (1, 1, LPAD('Value 1', 3000, ' ')),
        (2, NULL, LPAD('Value 2', 3000, ' ')),
        (3, NULL, LPAD('Value 3', 3000, ' ')),
        (4, NULL, LPAD('Value 4', 3000, ' ')),
        (5, NULL, LPAD('Value 5', 3000, ' ')),
        (6, NULL, LPAD('Value 6', 3000, ' ')),
        (7, NULL, LPAD('Value 7', 3000, ' ')),
        (8, NULL, LPAD('Value 8', 3000, ' ')),
        (9, NULL, LPAD('Value 9', 3000, ' ')),
        (10, NULL, LPAD('Value 10', 3000, ' '));
INSERT
INTO    t_driven (id, value)
VALUES
        (1, LPAD('Value 1', 3000, ' ')),
        (2, LPAD('Value 2', 3000, ' ')),
        (3, LPAD('Value 3', 3000, ' ')),
        (4, LPAD('Value 4', 3000, ' ')),
        (5, LPAD('Value 5', 3000, ' ')),
        (6, LPAD('Value 6', 3000, ' ')),
        (7, LPAD('Value 7', 3000, ' ')),
        (8, LPAD('Value 8', 3000, ' ')),
        (9, LPAD('Value 9', 3000, ' ')),
        (10, LPAD('Value 10', 3000, ' '));
EXPLAIN
SELECT  *
FROM    t_leading l FORCE INDEX (ix_leading_selector)
JOIN
        t_driven d FORCE INDEX (PRIMARY)
ON      d.id = l.selector;

1, 'SIMPLE', 'l', 'ALL', 'ix_leading_selector', '', '', '', 10, ''
1, 'SIMPLE', 'd', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.l.selector', 1, ''

EXPLAIN
SELECT  *
FROM    t_leading l FORCE INDEX (ix_leading_selector)
JOIN
        t_driven d FORCE INDEX (PRIMARY)
ON      d.id = l.selector
WHERE   l.selector IS NOT NULL;

1, 'SIMPLE', 'l', 'range', 'ix_leading_selector', 'ix_leading_selector', '5', '', 1, 'Using where'
1, 'SIMPLE', 'd', 'eq_ref', 'PRIMARY', 'PRIMARY', '4', 'test.l.selector', 1, ''

Since these queries are in fact identical (a satisfied equality condition implies NOT NULL on both columns), I expected same plan to be used for both queries, which would use ix_leading_selector, as FORCE INDEX says.

However, the first query uses full table scan over t_leading, even despite FORCE INDEX being used.

This is not efficient, since there is only 1 non-NULL column and 9 NULL columns, and the table rows very large.

Besides, this behavior contradicts documentation, which says:

> You can also use FORCE INDEX, which acts like USE INDEX (index_list) but with the addition that a table scan is assumed to be very expensive. In other words, a table scan is used only if there is no way to use one of the given indexes to find rows in the table. 

There certainly is a way to use this index, since this query, which differs only in JOIN order, does use it:

EXPLAIN
SELECT  *
FROM    t_driven d FORCE INDEX (PRIMARY)
STRAIGHT_JOIN
        t_leading l FORCE INDEX (ix_leading_selector)
ON      d.id = l.selector;

1, 'SIMPLE', 'd', 'ALL', 'PRIMARY', '', '', '', 10, ''
1, 'SIMPLE', 'l', 'ref', 'ix_leading_selector', 'ix_leading_selector', '5', 'test.d.id', 5, 'Using where'

Suggested fix:
Fix the optimizer algorithm so that the index can be FORCED for the first query
[4 Jun 2009 6:25] Sveta Smirnova
Thank you for the report.

Verified as described.