Bug #104824 | Non-deterministic left join is filtering records from the result set | ||
---|---|---|---|
Submitted: | 3 Sep 2021 17:56 | Modified: | 7 Sep 2021 13:16 |
Reporter: | Bob Dusek | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: DML | Severity: | S4 (Feature request) |
Version: | 5.7.34 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[3 Sep 2021 17:56]
Bob Dusek
[6 Sep 2021 13:30]
MySQL Verification Team
Hi Mr. Dusek, Thank you for your bug report. However, we are not able to repeat your behaviour. We ran your test case hundred of times. Filtering of the records depended very much of the random element that you have introduced. Simply if random number was smaller or equal to 0.5, it can not show up in the result set !!!! However, when we inspected the output, all the rows were correct. Each pair of values has always showed up in the output. That is what is relevant in the stochastic query that you have devised. Here is the exceprt from our output : ================================================ t2Id t1Id 1 1 2 2 -1 3 t2Id t1Id 2 2 -1 3 3 4 t2Id t1Id 2 2 3 4 t2Id t1Id 1 1 t2Id t1Id -1 3 t2Id t1Id 1 1 2 2 -1 3 3 4 t2Id t1Id 1 1 2 2 t2Id t1Id 1 1 2 2 -1 3 t2Id t1Id 1 1 2 2 -1 3 3 4 t2Id t1Id 1 1 -1 3 3 4 t2Id t1Id 1 1 t2Id t1Id 2 2 -1 3 3 4 t2Id t1Id -1 3 t2Id t1Id 1 1 2 2 -1 3 t2Id t1Id 1 1 -1 3 t2Id t1Id 2 2 3 4 t2Id t1Id 1 1 2 2 3 4 t2Id t1Id -1 3 t2Id t1Id 2 2 t2Id t1Id 1 1 2 2 3 4 t2Id t1Id -1 3 t2Id t1Id 1 1 2 2 3 4 t2Id t1Id 1 1 3 4 t2Id t1Id 2 2 t2Id t1Id 3 4 t2Id t1Id 1 1 2 2 -1 3 t2Id t1Id 2 2 -1 3 ================================================ We can not repeat anything wrong in the output.
[6 Sep 2021 17:14]
Bob Dusek
The excerpt you posted from your output demonstrates the problem. Why doesn't every result set have all four records from table 1? I you run this query 100 times: SELECT tableA.id as tableAId, COALESCE(tableB.id, -1) as tableBId FROM tableA LEFT JOIN tableB ON (tableB.tableAId=tableA.id AND 1=0); You will get this result set 100 times: 1,-1 2,-1 3,-1 4,-1 And, if you run this query 100 times: SELECT tableA.id as tableAId, COALESCE(tableB.id, -1) as tableBId FROM tableA LEFT JOIN tableB ON (tableB.tableAId=tableA.id AND 0=0); You will get this result 100 times: 1,1 2,2 3,-1 4,3 Why should this query 100 times: SELECT tableA.id as tableAId, COALESCE(tableB.id, -1) as tableBId FROM tableA LEFT JOIN tableB ON (tableB.tableAId=tableA.id AND 0.5 > RAND()); Why wouldn't id's 1, 2, 3 and 4 show up in all 100 results? If it doesn't then the concept of a left join is completely broken.
[7 Sep 2021 11:51]
MySQL Verification Team
Hi Mr. Dusek, The answer is in the basic definitions of the SQL language. For this particular case, it is described in the standard in the chapters 4 and 7. ON clause can take only expressions that define full or semi-join. If there is an expression that does not qualify as join definition expression, then it is treated as a filtering condition. That is to say, it is the same as if you put it in the WHERE clause ........ Not a bug.
[7 Sep 2021 12:27]
Bob Dusek
So, why does this query not return an empty set? SELECT tableA.id as tableAId, COALESCE(tableB.id, -1) as tableBId FROM tableA LEFT JOIN tableB ON (tableB.tableAId=tableA.id AND 1=0); That query returns this EVERY time... 1,-1 2,-1 3,-1 4,-1
[7 Sep 2021 12:42]
MySQL Verification Team
Hi Mr. Dusek, Regarding your last example, you are correct. That is a bug, but a very small one, since you can easily correct it by setting 1=0 in the WHERE clause. Verified as a low priority bug.
[7 Sep 2021 13:07]
Bob Dusek
The bug seems to be that MySQL is arbitrarily determining how to evaluate join expressions, making it difficult to predict how they will behave in practice. Is there documentation available regarding how MySQL determines whether part of a join expression is evaluated with the join or moved to the WHERE clause?
[7 Sep 2021 13:16]
Bob Dusek
And, unpredictable determination of join logic is NOT a minor bug. It can impact lots and lots of queries for lots and lots of people. The only time such a condition seems to get moved to a WHERE clause (from my experimentation) is when there is non-determinism. We have a sub query that became non-deterministic when we converted to 5.7 even though it was stable in 5.6. This subquery started filtering our results from our result set and caused some pretty serious trouble. I have not been able to reproduce the issue on a new data set, but we have 5 different servers that exhibit the behavior with a large data set. We have resolved it by stabilizing the subquery, but if you "fix" something and suddenly move all such join logic to the WHERE clause the implications could be significant for many people.