Bug #3765 Optimizer error using both Left and Right joins between the same tables
Submitted: 14 May 2004 16:17 Modified: 17 May 2004 12:39
Reporter: Shawn Green Email Updates:
Status: Duplicate Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1.1a-alpha-nt-log OS:Windows (Win2K)
Assigned to: CPU Architecture:Any

[14 May 2004 16:17] Shawn Green
Description:
In response to a posting on the MySQL lists I was trying to help someone answer the question of how many unique data items exist in similar columns in two separate tables. Since the person needing help was particularly iterested in counting IP addresses I set up a pair of tables with lists of IP addresses. I created multiple copies of the same values in both tables but only shared 1 value between the two.

I attempted to create a non-UNION query that recreates the function of the not-yet-implemented FULL OUTER JOIN. I took the first table and LEFT JOINed it to the second AND I RIGHT JOINed the second table in the same query. What I ended up with was the cartesian product (cross product) of the two tables. After thinking about it, it didn't seem like the right results (see analysis after problem setup).

How to repeat:
Create two tables, tablea and tableb. Here are the defs I used:

mysql> show create table tablea \G
*************************** 1. row ***************************
       Table: tablea
Create Table: CREATE TABLE `tablea` (
  `IP` varchar(16) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

mysql> show create table tableb \G
*************************** 1. row ***************************
       Table: tableb
Create Table: CREATE TABLE `tableb` (
  `IP` varchar(16) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)

Populate tablea and tableb with test data. Use dupliate entries for each value. Match data between tables on only one value. Here is my sample data:
mysql> select * from tablea;
+-------------+
| IP          |
+-------------+
| 192.168.1.1 |
| 192.168.1.1 |
| 192.168.1.1 |
| 192.168.1.3 |
| 192.168.1.3 |
| 192.168.1.3 |
| 192.168.1.5 |
| 192.168.1.5 |
| 192.168.1.5 |
| 192.168.1.3 |
+-------------+
10 rows in set (0.00 sec)

mysql> select * from tableb;
+-------------+
| IP          |
+-------------+
| 192.168.1.2 |
| 192.168.1.2 |
| 192.168.1.2 |
| 192.168.1.4 |
| 192.168.1.4 |
| 192.168.1.4 |
| 192.168.1.6 |
| 192.168.1.6 |
| 192.168.1.6 |
| 192.168.1.8 |
| 192.168.1.8 |
| 192.168.1.8 |
| 192.168.1.3 |
| 192.168.1.3 |
+-------------+
14 rows in set (0.00 sec)

Note that 192.168.1.3 appears in both tablea and tableb.

This was my test query:
SELECT COUNT(DISTINCT IFNULL(a.ip, c.ip)) as IP_Count
FROM tablea a
LEFT JOIN tableb b
      ON b.ip = a.ip
RIGHT JOIN tableb c
      ON c.ip = a.ip;

which resulted in 3 (a reasonable number, if not exactly what I wanted). I removed the functions and opened up the select to view the raw data that the COUNT(DISTINCT) was working on.

SELECT A.ip as a, B.ip as b, C.ip as c
FROM tablea A
LEFT JOIN tableb B
      ON b.ip = a.ip
RIGHT JOIN tableb C
      ON c.ip = a.ip;

which produced the following output (shortened for space):
+-------------+-------------+-------------+
| a           | b           | c           |
+-------------+-------------+-------------+
| 192.168.1.1 | NULL        | 192.168.1.2 |
       ----- clipped for space -----
| 192.168.1.5 | NULL        | 192.168.1.8 |
| 192.168.1.5 | NULL        | 192.168.1.8 |
| 192.168.1.5 | NULL        | 192.168.1.8 |
| 192.168.1.3 | NULL        | 192.168.1.8 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.1 | NULL        | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.5 | NULL        | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+-------------+
148 rows in set (0.00 sec)

I recognize this as a cartesian product but I don't think that was what I asked for. I tried to put myself in the shoes of the query engine and thought about how would I solve that query. I decided that I would have to create an intermediate result set based on one of the JOIN conditions then use that subset of records as one side of the second JOIN.

Case 1:
Create an intermediate result set (T) from:
FROM tablea A
LEFT JOIN tableb B
      ON B.ip = A.ip

(T)
+-------------+-------------+
| A.ip        | B.ip        |
+-------------+-------------+
| 192.168.1.1 | NULL        |
| 192.168.1.1 | NULL        |
| 192.168.1.1 | NULL        |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL        |
| 192.168.1.5 | NULL        |
| 192.168.1.5 | NULL        |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+
then RIGHT JOIN T to C (FROM T RIGHT JOIN tableb C on C.IP = A.IP). In this case I expect to produce 28 rows. 12 of them from unique C values and 16 (8x2) more for the matches on 192.168.1.3.

Case 2:
Create an intermediate result set T2:
FROM tablea A
RIGHT JOIN tableb C
      ON C.ip = B.ip;

(T2)
+-------------+-------------+
| A.ip        | C.ip        |
+-------------+-------------+
| NULL        | 192.168.1.2 |
| NULL        | 192.168.1.2 |
| NULL        | 192.168.1.2 |
| NULL        | 192.168.1.4 |
| NULL        | 192.168.1.4 |
| NULL        | 192.168.1.4 |
| NULL        | 192.168.1.6 |
| NULL        | 192.168.1.6 |
| NULL        | 192.168.1.6 |
| NULL        | 192.168.1.8 |
| NULL        | 192.168.1.8 |
| NULL        | 192.168.1.8 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+
20 rows
And LEFT JOIN B to T2. (FROM T2 LEFT JOIN tableb B on B.IP = A.IP) to get 28 rows as a result. (12 of them wher A.IP is null and 16 (the 8 from T2 times the 2 from B) where they match on 192.168.1.3)

In both cases I only get 28 rows of data and not the full 148 of the cross product. If this is a flaw in my logic, I humbly apologize for taking up so much of your time.
[17 May 2004 12:39] Sergei Golubchik
Please do not submit the same bug more than once. An existing
bug report already describes this very problem. Even if you feel
that your issue is somewhat different, the resolution is likely
to be the same. Because of this, we hope you add your comments
to the original bug instead.

Thank you for your interest in MySQL.

Additional info:

see bugs http://bugs.mysql.com/1677 and http://bugs.mysql.com/1591
[17 May 2004 15:32] Shawn Green
Please forgive me. I did not think I was creating a duplicate. I Searched the bug list and didn't see an active bug with this basic issue. I *DID* however post nearly this exact text to the group/list "mysql@lists.mysql.com" and thought it needed to be listed as a bug as well. Can you please reference me to the original bug so that I can keep up with that one instead of this one? 
Thanks and apologies,
Shawn
[17 May 2004 15:38] Shawn Green
Nevermind, I seem to be blind today. Again, apologies!!!!
Shawn