MySQL Bugs: #2615: Performance problem when joining NULLs

Bug #2615	Performance problem when joining NULLs
Submitted:	2 Feb 2004 11:03	Modified:	18 Apr 2007 22:49
Reporter:	Nathan Goodman	Email Updates:
Status:	Duplicate	Impact on me:	None
Category:	MySQL Server: MyISAM storage engine	Severity:	S2 (Serious)
Version:	4.1	OS:	Linux (Red Hat 9.0)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When there are lots of NULLs in the columns on both sides of a join, performance drastically slows down. 

With inner joins, the obvious workaround is to check for NULL in the WHERE clause.  With outer joins -- the case I care about -- this changes the answer.  Putting the check in the ON clause doesn't fix the performance problem.

How to repeat:
I have a complete test program which I can send, but it's pretty easy to reproduce the problem from scratch as follows.

CREATE TABLE AB (a INT NULL, b INT NULL);
CREATE TABLE BC (b INT NULL, c INT NULL, INDEX(b));

Insert into AB one row (1,1), then lots of rows (i,NULL) for i=2,...
Insert into BC one row (1,1), then lots of rows (NULL,i) for i=2,...

The test query is

SELECT AB.a, AB.b AS b0, BC.b AS b1, BC.c FROM AB, BC WHERE AB.b=BC.b;

The answer, of course, is 

+------+------+------+------+
| a    | b0   | b1   | c    |
+------+------+------+------+
|    1 |    1 |    1 |    1 |
+------+------+------+------+

With 1000 rows it takes 2.6 secs to compute this answer on my machine. 
With 2000 rows, it takes 10.5 secs.
With 4000 rows, it takes 41.5 secs.

My real application uses an outer join similar to

SELECT AB.a, AB.b AS b0, BC.b AS b1, BC.c FROM AB LEFT JOIN BC ON AB.b=BC.b;

My real database has many tens of thousands of NULLs on both sides of the join.  As you can guess, my real query on my real database takes a VERY long time to complete -- so long that my users quit before it finishes and tell me that the system is hung!

Suggested fix:
It feels like the join processor is treating the NULLs as real values and constructing a large product (which it then discards since NULLs don't really match), rather than short-circuiting the computation as soon as it sees a NULL.

This is a good optimizing idea but not trivial to implement.  I did
look into adding this to 4.0, but concluded that the risk was too big
for accidently crashing something else.

What makes this a bit complex to implement is that the key-copy code
doesn't know if we are using = or <=> as a index comparison.  To fix
this we have to change some structures and function calls to store
what comparison is used and also add 'fast' detection if a null was used
in the index.

We will look into adding this to 4.1 in the near future.

Regards,
Monty

This problem has been addressed in fix for BUG#8877.

Now we have short-cutting done as soon as possible:

SELECT AB.a, AB.b AS b0, BC.b AS b1, BC.c FROM AB, BC WHERE AB.b=BC.b;
show status like 'Handler_%'; 
+----------------------------+-------+
| Variable_name              | Value |
+----------------------------+-------+
| Handler_commit             | 0     | 
| Handler_delete             | 0     | 
| Handler_discover           | 0     | 
| Handler_prepare            | 0     | 
| Handler_read_first         | 0     | 
| Handler_read_key           | 1     | 
| Handler_read_next          | 2     | 
| Handler_read_prev          | 0     | 
| Handler_read_rnd           | 0     | 
| Handler_read_rnd_next      | 1002  | 
| Handler_rollback           | 0     | 
| Handler_savepoint          | 0     | 
| Handler_savepoint_rollback | 0     | 
| Handler_update             | 0     | 
| Handler_write              | 14    | 
+----------------------------+-------+
15 rows in set (0.01 sec)

The documentation is here:

http://dev.mysql.com/doc/internals/en/optimizer-nulls-filtering.html

(You might want a little more till BUG#27939 is fixed and NULLs filtering works for eq_ref too).

Marking as duplicate of BUG#8877