Description:
In response to a posting on the MySQL lists I was trying to help someone answer the question of how many unique data items exist in similar columns in two separate tables. Since the person needing help was particularly iterested in counting IP addresses I set up a pair of tables with lists of IP addresses. I created multiple copies of the same values in both tables but only shared 1 value between the two.
I attempted to create a non-UNION query that recreates the function of the not-yet-implemented FULL OUTER JOIN. I took the first table and LEFT JOINed it to the second AND I RIGHT JOINed the second table in the same query. What I ended up with was the cartesian product (cross product) of the two tables. After thinking about it, it didn't seem like the right results (see analysis after problem setup).
How to repeat:
Create two tables, tablea and tableb. Here are the defs I used:
mysql> show create table tablea \G
*************************** 1. row ***************************
Table: tablea
Create Table: CREATE TABLE `tablea` (
`IP` varchar(16) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
mysql> show create table tableb \G
*************************** 1. row ***************************
Table: tableb
Create Table: CREATE TABLE `tableb` (
`IP` varchar(16) default NULL
) ENGINE=MyISAM DEFAULT CHARSET=latin1
1 row in set (0.00 sec)
Populate tablea and tableb with test data. Use dupliate entries for each value. Match data between tables on only one value. Here is my sample data:
mysql> select * from tablea;
+-------------+
| IP |
+-------------+
| 192.168.1.1 |
| 192.168.1.1 |
| 192.168.1.1 |
| 192.168.1.3 |
| 192.168.1.3 |
| 192.168.1.3 |
| 192.168.1.5 |
| 192.168.1.5 |
| 192.168.1.5 |
| 192.168.1.3 |
+-------------+
10 rows in set (0.00 sec)
mysql> select * from tableb;
+-------------+
| IP |
+-------------+
| 192.168.1.2 |
| 192.168.1.2 |
| 192.168.1.2 |
| 192.168.1.4 |
| 192.168.1.4 |
| 192.168.1.4 |
| 192.168.1.6 |
| 192.168.1.6 |
| 192.168.1.6 |
| 192.168.1.8 |
| 192.168.1.8 |
| 192.168.1.8 |
| 192.168.1.3 |
| 192.168.1.3 |
+-------------+
14 rows in set (0.00 sec)
Note that 192.168.1.3 appears in both tablea and tableb.
This was my test query:
SELECT COUNT(DISTINCT IFNULL(a.ip, c.ip)) as IP_Count
FROM tablea a
LEFT JOIN tableb b
ON b.ip = a.ip
RIGHT JOIN tableb c
ON c.ip = a.ip;
which resulted in 3 (a reasonable number, if not exactly what I wanted). I removed the functions and opened up the select to view the raw data that the COUNT(DISTINCT) was working on.
SELECT A.ip as a, B.ip as b, C.ip as c
FROM tablea A
LEFT JOIN tableb B
ON b.ip = a.ip
RIGHT JOIN tableb C
ON c.ip = a.ip;
which produced the following output (shortened for space):
+-------------+-------------+-------------+
| a | b | c |
+-------------+-------------+-------------+
| 192.168.1.1 | NULL | 192.168.1.2 |
----- clipped for space -----
| 192.168.1.5 | NULL | 192.168.1.8 |
| 192.168.1.5 | NULL | 192.168.1.8 |
| 192.168.1.5 | NULL | 192.168.1.8 |
| 192.168.1.3 | NULL | 192.168.1.8 |
| 192.168.1.1 | NULL | 192.168.1.3 |
| 192.168.1.1 | NULL | 192.168.1.3 |
| 192.168.1.1 | NULL | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL | 192.168.1.3 |
| 192.168.1.5 | NULL | 192.168.1.3 |
| 192.168.1.5 | NULL | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.1 | NULL | 192.168.1.3 |
| 192.168.1.1 | NULL | 192.168.1.3 |
| 192.168.1.1 | NULL | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL | 192.168.1.3 |
| 192.168.1.5 | NULL | 192.168.1.3 |
| 192.168.1.5 | NULL | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+-------------+
148 rows in set (0.00 sec)
I recognize this as a cartesian product but I don't think that was what I asked for. I tried to put myself in the shoes of the query engine and thought about how would I solve that query. I decided that I would have to create an intermediate result set based on one of the JOIN conditions then use that subset of records as one side of the second JOIN.
Case 1:
Create an intermediate result set (T) from:
FROM tablea A
LEFT JOIN tableb B
ON B.ip = A.ip
(T)
+-------------+-------------+
| A.ip | B.ip |
+-------------+-------------+
| 192.168.1.1 | NULL |
| 192.168.1.1 | NULL |
| 192.168.1.1 | NULL |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.5 | NULL |
| 192.168.1.5 | NULL |
| 192.168.1.5 | NULL |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+
then RIGHT JOIN T to C (FROM T RIGHT JOIN tableb C on C.IP = A.IP). In this case I expect to produce 28 rows. 12 of them from unique C values and 16 (8x2) more for the matches on 192.168.1.3.
Case 2:
Create an intermediate result set T2:
FROM tablea A
RIGHT JOIN tableb C
ON C.ip = B.ip;
(T2)
+-------------+-------------+
| A.ip | C.ip |
+-------------+-------------+
| NULL | 192.168.1.2 |
| NULL | 192.168.1.2 |
| NULL | 192.168.1.2 |
| NULL | 192.168.1.4 |
| NULL | 192.168.1.4 |
| NULL | 192.168.1.4 |
| NULL | 192.168.1.6 |
| NULL | 192.168.1.6 |
| NULL | 192.168.1.6 |
| NULL | 192.168.1.8 |
| NULL | 192.168.1.8 |
| NULL | 192.168.1.8 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
| 192.168.1.3 | 192.168.1.3 |
+-------------+-------------+
20 rows
And LEFT JOIN B to T2. (FROM T2 LEFT JOIN tableb B on B.IP = A.IP) to get 28 rows as a result. (12 of them wher A.IP is null and 16 (the 8 from T2 times the 2 from B) where they match on 192.168.1.3)
In both cases I only get 28 rows of data and not the full 148 of the cross product. If this is a flaw in my logic, I humbly apologize for taking up so much of your time.