Bug #110245 Issue with NOT IN
Submitted: 1 Mar 2023 3:56 Modified: 2 Mar 2023 9:59
Reporter: Pinhan Zhao Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:8.0.32 OS:Any
Assigned to: CPU Architecture:Any

[1 Mar 2023 3:56] Pinhan Zhao
Description:
With the following database:
```
CREATE TABLE PLAYBACK (
  SESSION_ID INTEGER primary key,
  CUSTOMER_ID INTEGER,
  START_TIME INTEGER,
  END_TIME INTEGER
);
INSERT INTO PLAYBACK VALUES (-1, 0, 1, 1);
INSERT INTO PLAYBACK VALUES (0, 0, 1, 1);
CREATE TABLE ADS (
  AD_ID INTEGER primary key,
  CUSTOMER_ID INTEGER,
  TIMESTAMP INTEGER
);
INSERT INTO ADS VALUES (0, 0, 1);
INSERT INTO ADS VALUES (1, 0, 1);
```

After running Q1,
```
-- Q1
SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID;
```

the output is a table with one column and one row, and the only value is 0:
+-----------------------+
| ANY_VALUE(SESSION_ID) |
+-----------------------+
|                     0 |
+-----------------------+

And we have Q2:
```
-- Q2
SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID);
```
Q2 is basically a query of `SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN Q1`

Therefore, we would expect it outputs one row with SESSION_ID of -1 .  However, the output is an empty set.  We can further investigate this by running Q3:
```
-- Q3
SELECT SESSION_ID, (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS Q1, SESSION_ID NOT IN (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS 'SESSION_ID NOT IN Q1' FROM PLAYBACK;
```

The output is:
+------------+------+----------------------+
| SESSION_ID | Q1   | SESSION_ID NOT IN Q1 |
+------------+------+----------------------+
|         -1 |    0 |                    0 |
|          0 |    0 |                    0 |
+------------+------+----------------------+

We can see the result doesn’t make sense, because -1 NOT IN (0) shouldn’t be false.

How to repeat:
CREATE TABLE PLAYBACK (
  SESSION_ID INTEGER primary key,
  CUSTOMER_ID INTEGER,
  START_TIME INTEGER,
  END_TIME INTEGER
);
INSERT INTO PLAYBACK VALUES (-1, 0, 1, 1);
INSERT INTO PLAYBACK VALUES (0, 0, 1, 1);
CREATE TABLE ADS (
  AD_ID INTEGER primary key,
  CUSTOMER_ID INTEGER,
  TIMESTAMP INTEGER
);
INSERT INTO ADS VALUES (0, 0, 1);
INSERT INTO ADS VALUES (1, 0, 1);

-- Q1
SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID;
-- Q2
SELECT SESSION_ID FROM PLAYBACK WHERE SESSION_ID NOT IN (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID);
-- Q3
SELECT SESSION_ID, (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS Q1, SESSION_ID NOT IN (SELECT ANY_VALUE(SESSION_ID) FROM PLAYBACK JOIN ADS ON PLAYBACK.CUSTOMER_ID=ADS.CUSTOMER_ID WHERE TIMESTAMP >= START_TIME AND TIMESTAMP <= END_TIME GROUP BY PLAYBACK.CUSTOMER_ID) AS 'SESSION_ID NOT IN Q1' FROM PLAYBACK;
[1 Mar 2023 6:33] MySQL Verification Team
Hello Luca Zhao,

Thank you for the report and test case.

regards,
Umesh
[2 Mar 2023 9:59] Norvald Ryeng
Posted by developer:
 
I'm closing this as not a bug.

ANY_VALUE means just that -- any value. There is no guarantee that this value is the same if the query is repeated, or that it will be the same value for all references to ANY_VALUE of the same column in a single query.

All three query results are correct. The value used for ANY_VALUE(SESSION_ID) is one of the possible values for that column in all three cases.

Q1: Possible values are -1 and 0. 0 is chosen.

Q2: Loop over all rows of PLAYBACK, and evaluate the WHERE condition, including the subquery, for each row of PLAYBACK. Both SESSION_ID values in PLAYBACK are valid results of ANY_VALUE(SESSION_ID) in the inner query. The query is evaluated individually for each row, and there is no guarantee that the result will be the same in both evaluations. The observed result is an evaluation where SESSION_ID is indeed the value chosen for ANY_VALUE(SESSION_ID) for each row.

Q3: Again, the subqueries are evaluated several times, and there is no guarantee that ANY_VALUE(SESSION_ID) returns the same value in all cases. The column named "SESSION_ID NOT IN Q1" is not actually checking against the column named "Q1", but against a re-evaluation of the query, which may return either -1 or 0.

The observed results may not be what was intended by the person who wrote the queries, but they are indeed valid interpretations of the queries. Therefore, I'm closing this as not a bug.