MySQL Bugs: #40433: NDBAPI: NdbScanFilter to filter on a set of values

Bug #40433	NDBAPI: NdbScanFilter to filter on a set of values
Submitted:	30 Oct 2008 15:26
Reporter:	Gustaf Thorslund	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Cluster: NDB API	Severity:	S4 (Feature request)
Version:	mysql-5.1-telco-6.3	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	CGE 6.3.16

Description:
When wanting to scan for a value that may match any of a number of values one currently have two options:
1) Build a condition with a lot of OR for each allowed value
2) Filter in the application at the clients side

1) Can result in a lot of conditions being sent to the data node and it might not be fast to execute on the data node if there are too many branch conditions.

2) Can result in a lot of data being sent to the client.

A third option is to combine #1 and #2.

A nice feature would be to be able to compare against a set of values.

How to repeat:
Implement something like this using NDBAPI, but with more rows and more values in the WHERE xn IN (...) part. Looking at the EXPLAIN the conditions have been pushed down to the storage engine, but that's but then the the storage engine will use option 1) or 2) above.

mysql> CREATE TABLE scanMe (x1 int, x2 int, x3 int, key (x1, x2, x3)) ENGINE=ndbcluster;
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO scanMe VALUES (1,1,1),(1,2,3),(3,2,1);
Query OK, 3 rows affected (0.00 sec)
mysql> SELECT * FROM scanMe WHERE x1 IN (3,5,7,9,11,13,15,17,19,21) OR x2 IN (3,5,7,9,11,13,15,17,19,21) OR x3 IN (3,5,7,9,11,13,15,17,19,21);
+------+------+------+
| x1   | x2   | x3   |
+------+------+------+
|    3 |    2 |    1 | 
|    1 |    2 |    3 | 
+------+------+------+
2 rows in set (0.00 sec)

mysql> EXPLAIN SELECT * FROM scanMe WHERE x1 IN (3,5,7,9,11,13,15,17,19,21) OR x2 IN (3,5,7,9,11,13,15,17,19,21) OR x3 IN (3,5,7,9,11,13,15,17,19,21);
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------------+
| id | select_type | table  | type | possible_keys | key  | key_len | ref  | rows | Extra                             |
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------------+
|  1 | SIMPLE      | scanMe | ALL  | x1            | NULL | NULL    | NULL |    3 | Using where with pushed condition | 
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------------+
1 row in set (0.00 sec)

Suggested fix:
// Draft API idea:
class NdbCmpValueSet {
public:
  NdbCmpValueSet(Uint32 length = 0);
  // return an error code/success code
  int insertValue(const void* value, Uint32 length = 0);
  bool haveValue(const void* value, Uint32 length = 0);
};

class NdbScanFilter {
  // ...
  cmpSet(const NdbCmpValueSet&);
  // ...
};

// The value set would use a hash table for fast lookup and will be
// sent to the data nodes. One idea would be to only send the hash
// indexes where there is a value, and not the values. This could
// result in false positives and return rows to the API note that
// should not be included. It might be OK to filter them on the API
// node (but transparent to the user), but when scanning for values
// not in the set this would instead result in false negatives, and
// this is not OK. This kind of optimizations could be done internally
// and should not have much impact on the suggested API above.

More readable version of the EXPLAIN SELECT.

mysql> EXPLAIN SELECT * FROM scanMe WHERE x1 IN (3,5,7,9,11,13,15,17,19,21) OR x2 IN (3,5,7,9,11,13,15,17,19,21) OR x3 IN (3,5,7,9,11,13,15,17,19,21)\G
*************************** 1. row ***************************
           id: 1
  select_type: SIMPLE
        table: scanMe
         type: ALL
possible_keys: x1
          key: NULL
      key_len: NULL
          ref: NULL
         rows: 3
        Extra: Using where with pushed condition
1 row in set (0.00 sec)