Description:
When wanting to scan for a value that may match any of a number of values one currently have two options:
1) Build a condition with a lot of OR for each allowed value
2) Filter in the application at the clients side
1) Can result in a lot of conditions being sent to the data node and it might not be fast to execute on the data node if there are too many branch conditions.
2) Can result in a lot of data being sent to the client.
A third option is to combine #1 and #2.
A nice feature would be to be able to compare against a set of values.
How to repeat:
Implement something like this using NDBAPI, but with more rows and more values in the WHERE xn IN (...) part. Looking at the EXPLAIN the conditions have been pushed down to the storage engine, but that's but then the the storage engine will use option 1) or 2) above.
mysql> CREATE TABLE scanMe (x1 int, x2 int, x3 int, key (x1, x2, x3)) ENGINE=ndbcluster;
Query OK, 0 rows affected (0.04 sec)
mysql> INSERT INTO scanMe VALUES (1,1,1),(1,2,3),(3,2,1);
Query OK, 3 rows affected (0.00 sec)
mysql> SELECT * FROM scanMe WHERE x1 IN (3,5,7,9,11,13,15,17,19,21) OR x2 IN (3,5,7,9,11,13,15,17,19,21) OR x3 IN (3,5,7,9,11,13,15,17,19,21);
+------+------+------+
| x1 | x2 | x3 |
+------+------+------+
| 3 | 2 | 1 |
| 1 | 2 | 3 |
+------+------+------+
2 rows in set (0.00 sec)
mysql> EXPLAIN SELECT * FROM scanMe WHERE x1 IN (3,5,7,9,11,13,15,17,19,21) OR x2 IN (3,5,7,9,11,13,15,17,19,21) OR x3 IN (3,5,7,9,11,13,15,17,19,21);
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------------+
| 1 | SIMPLE | scanMe | ALL | x1 | NULL | NULL | NULL | 3 | Using where with pushed condition |
+----+-------------+--------+------+---------------+------+---------+------+------+-----------------------------------+
1 row in set (0.00 sec)
Suggested fix:
// Draft API idea:
class NdbCmpValueSet {
public:
NdbCmpValueSet(Uint32 length = 0);
// return an error code/success code
int insertValue(const void* value, Uint32 length = 0);
bool haveValue(const void* value, Uint32 length = 0);
};
class NdbScanFilter {
// ...
cmpSet(const NdbCmpValueSet&);
// ...
};
// The value set would use a hash table for fast lookup and will be
// sent to the data nodes. One idea would be to only send the hash
// indexes where there is a value, and not the values. This could
// result in false positives and return rows to the API note that
// should not be included. It might be OK to filter them on the API
// node (but transparent to the user), but when scanning for values
// not in the set this would instead result in false negatives, and
// this is not OK. This kind of optimizations could be done internally
// and should not have much impact on the suggested API above.