Bug #73368 NOT IN doesn't return expected result
Submitted: 23 Jul 2014 20:49 Modified: 27 Oct 2014 17:36
Reporter: Nikola Glavinceski Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S3 (Non-critical)
Version:5.6.19 OS:Linux (Ubuntu 12.04)
Assigned to: CPU Architecture:Any
Tags: not exists, NOT IN

[23 Jul 2014 20:49] Nikola Glavinceski
Description:
NOT IN with subquery doesn't return expected results on MySql 5.6.19 Ubuntu, while it returns correct results on 5.6.13 Mac OS X.

How to repeat:
Table: customers
+------------------------+--------------+------+-----+---------+----------------+
| Field                  | Type         | Null | Key | Default | Extra          |
+------------------------+--------------+------+-----+---------+----------------+
| id                     | int(11)      | NO   | PRI | NULL    | auto_increment |
| email                  | varchar(255) | NO   | MUL | NULL    |                |

Table: emails_queue
+--------------+-------------------------------+------+-----+---------------+----------------+
| Field        | Type                          | Null | Key | Default       | Extra          |
+--------------+-------------------------------+------+-----+---------------+----------------+
| id           | bigint(20) unsigned           | NO   | PRI | NULL          | auto_increment |
| template_id  | bigint(20)                    | YES  |     | NULL          |                |
| to_address   | varchar(500)                  | YES  | MUL | NULL          |                |
+--------------+-------------------------------+------+-----+---------------+----------------+

mysql> SELECT COUNT(*) FROM `customers`;
+----------+
| COUNT(*) |
+----------+
|    96703 |
+----------+

mysql> SELECT COUNT(*) FROM `customers` WHERE email NOT IN(SELECT `to_address` FROM `emails_queue`);
+----------+
| COUNT(*) |
+----------+
|    96703 |
+----------+

mysql> SELECT COUNT(*) FROM `customers` WHERE NOT EXISTS(SELECT `to_address` FROM `emails_queue` WHERE `to_address` = email);
+----------+
| COUNT(*) |
+----------+
|    72367 |
+----------+

I think that "SELECT COUNT(*) FROM `customers` WHERE email NOT IN(SELECT `to_address` FROM `emails_queue`)" should return  "72367" instead "96703" 
Want to mention that the "email" field from customers table doesn't contains any null or empty field.
[24 Jul 2014 13:58] MySQL Verification Team
Hello,

Your bug report looks very interesting. However, we need to have to get from you some additional data. We need to repeat the bug itself, so that we can verify it. So, ....

First of all, please upload a dump of both tables. Upload them in .tar.gz format so that a file would be smaller.

Next, please send us the results that differ between 5.6.13 and 5.6.19. Send only results of the query that differs between the two versions. Platform should not be the one causing this.
[24 Jul 2014 14:01] MySQL Verification Team
Hello,

Your bug report looks very interesting. However, we need to have to get from you some additional data. We need to repeat the bug itself, so that we can verify it. So, ....

First of all, please upload a dump of both tables. Upload them in .tar.gz format so that a file would be smaller.

Next, please send us the results that differ between 5.6.13 and 5.6.19. Send only results of the query that differs between the two versions. Platform should not be the one causing this. I guess that problematic query should be the one with NOT IN nested query.

Please, also send the results, on both versions, for this query:

SELECT COUNT(*) FROM `customers`, `emails_queue` WHERE `to_address` != email;

Many thanks in advance.
[24 Jul 2014 14:30] MySQL Verification Team
We also need dumps of the tables in order to check for the engine used and character sets.
[24 Jul 2014 14:41] MySQL Verification Team
Please, can you try, with version 5.6.19, to run all of your queries with these two commands, each preceding a set of queries.

set optimizer_switch='semijoin=off'; 

and 

set optimizer_switch='semijoin=on';

Run queries after each of those two SET commands. Let us know if you see any differences.

Thank you in advance.
[24 Jul 2014 20:11] Nikola Glavinceski
Hello,

I am sorry but I can't provide dump of tables because they contains emails of our customers.

But I was able to fix the issue by replacing "varchar(500)" with "varchar(255)" on `to_address` from the `emails_queue` table.

Thanks,
Nikola
[24 Jul 2014 20:25] Nikola Glavinceski
Maybe is important to say that the charset of both tables is utf8. And the data of emails_queue.to_address are imported from customers.email.

Thanks,
Nikola
[24 Jul 2014 21:02] Sveta Smirnova
Thank you for the feedback.

Please provide output SHOW CREATE TABLE customers and SHOW CREATE TABLE emails_queue
[24 Jul 2014 21:35] Nikola Glavinceski
db dump with dummy data

Attachment: test.tar.gz (application/x-gzip, text), 805.53 KiB.

[24 Jul 2014 21:38] Nikola Glavinceski
I was able to reproduce the issue with dummy data.
Please use the queries from above on MySql 5.6.19 and I believe you will see the issue.
[25 Jul 2014 18:28] Sveta Smirnova
Thank you for the feedback.

Verified as described. Bug stops showing after I delete following rows:

delete from customers where id > 40000;

Workaround:

SELECT COUNT(*) FROM `customers` WHERE email NOT IN(SELECT `to_address` FROM `emails_queue` WHERE `to_address` = email);
[17 Sep 2014 10:06] Guilhem Bichot
Posted by developer:
 
other workaround:
set optimizer_switch='materialization=off';
[27 Oct 2014 17:36] Paul DuBois
Noted in 5.6.22, 5.7.6 changelogs.

Certain queries for which subquery materialization or UNION DISTINCT
was used together with a hash index on a temporary table could
produce incorrect results or cause a server exit.
[10 Dec 2014 14:48] Laurynas Biveinis
$ bzr log -r 6209 -n0
------------------------------------------------------------
revno: 6209 [merge]
committer: Guilhem Bichot <guilhem.bichot@oracle.com>
branch nick: 5.6
timestamp: Fri 2014-10-17 11:43:53 +0200
message:
  merge with latest 5.6
    ------------------------------------------------------------
    revno: 6207.1.1
    committer: Guilhem Bichot <guilhem.bichot@oracle.com>
    branch nick: 5.6
    timestamp: Wed 2014-09-24 15:30:51 +0200
    message:
      Fix for Bug#19297190 NOT IN DOESN'T RETURN EXPECTED RESULT
      We have a query with WHERE ... NOT IN (SELECT ...).
      The inner field is 1503 bytes (500 chars * 3, because it has utf8 charset).
      subquery_allows_materialization() considers subq-mat:
      per the fix for Bug 17566396: MATERIALIZATION IS NOT CHOSEN FOR LONG
      UTF8 VARCHAR FIELD, this field is small enough to _not_ be changed to a BLOB
      when stored into the tmp table (Item::is_blob_field() sees that 500 <
      CONVERT_IF_BIGGER_TO_BLOB), so subq-mat is chosen. Before that fix,
      the left operand of the inequality comparison would be the length in
      _bytes_ and subq-mat would not be chosen.
      In hash_sj_engine::setup():
      create_tmp_table() uses MEMORY table with a regular unique index,
      without "unique constraint"; right after creation, setup() verifies
      that the table has no "unique constraint", otherwise subq-mat would
      fail - see comment in setup() and see how
      indexsubquery_engine::setup() does a regular index lookup of a key
      made with copy_ref_key(): to look up into "unique constraint" it would
      need to compute a checksum in copy_ref_key().
      So far so good.
      At some point during materialization, MEMORY tmp table becomes full, we go
      into create_myisam_from_heap() then create_myisam_tmp_table():
      
          if (keyinfo->key_length >= table->file->max_key_length() ||
      	keyinfo->user_defined_key_parts > table->file->max_key_parts() ||
      	share->uniques)
          {
            /* Can't create a key; Make a unique constraint instead of a key */
            share->keys=    0;
            share->uniques= 1;
            using_unique_constraint=1;
      
      alas the first condition is 1503 > 1000 (engine is MyISAM) so we
      create "unique constraint" on the MyISAM table. Subq-mat is not ready
      for this, as we know, so we get a bad result: in fact, all look ups
      then return "not found", so "NOT IN" always returns TRUE.
      
      In 5.7, subq-mat has been made able to use "unique constraint" (see
      Bug 18227171 which was part of wl 6711); however this same testcase
      still crashes but for different reasons, a separate 5.7 patch will be
      made.
      
      For fixing 5.6:
      - we cannot simply undo the fix for Bug 17566396 as there was an
      associated customer
      - on the other hand we cannot modify the logic in
      create_myisam_tmp_table() to _not_ create a "unique constraint" in
      this case (it's 5.6, we must be conservative)
      - proposal: take a middle path: make subquery_allows_materialization()
      test the length (by extending is_blob_field()): if length is >=1000
      bytes, then don't use subq-mat (because there is risk that MyISAM
      table is used and creates "unique constraint"). 1000 bytes is 333 utf8
      characters. So we would have, for the limit of # of chars:
      - before the fix for Bug 17566396: 512/3 = 170 chars,
      - after it: 512 chars
      - after the proposed fix: 332 chars.
      The testcase in Bug 17566396 used 250 chars so still uses
      subq-mat, which is good.