Bug #23692 Falcon: searches fail if data is 0x00
Submitted: 26 Oct 2006 20:49 Modified: 15 May 2009 15:55
Reporter: Peter Gulutzan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Falcon storage engine Severity:S3 (Non-critical)
Version:5.1.13-falcon-alpha-pb-debug OS:Linux (SUSE 10.0 / 64-bit)
Assigned to: Lars-Erik Bjørk CPU Architecture:Any
Tags: F_ENCODING, INDEX

[26 Oct 2006 20:49] Peter Gulutzan
Description:
I create a Falcon table with a character column with an index.
I insert 0x00.
I search for < ''.
I see 0 rows.

How to repeat:
ChangeSet@1.2346, 2006-10-26 10:08:37-04:00

mysql> create table t62 (s1 char(1),key(s1)) engine=falcon;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t62 values (0x00);
Query OK, 1 row affected (0.00 sec)

mysql> select * from t62 where s1<'';
Empty set (0.00 sec)
[27 Oct 2006 21:16] Hakan Küçükyılmaz
Verified on Linux 32-bit, change set 1.2348, 2006-10-27.

Added test case falcon_bug_23692.test to 5.1-falcon tree.

Regards, Hakan
[19 Dec 2006 18:05] Peter Gulutzan
Another test case:

mysql> create table tv (s1 varbinary(5)) engine=falcon;
Query OK, 0 rows affected (0.01 sec)

mysql> create index iv on tv (s1);
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> insert into tv values (0x02),(0x0202);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select hex(s1) from tv where s1 >= 0x02 or s1 = 0x0202;
+---------+
| hex(s1) |
+---------+
| 02      |
+---------+
1 row in set (0.01 sec)
[20 Dec 2006 0:46] Hakan Küçükyılmaz
Verified latest comment from Peter Gulutzan on Linux 32-bit and added additional test to test case falcon_bug_23962.test and pushed it to 5.1-falcon tree.

Regards, Hakan
[21 Dec 2006 16:35] Kevin Lewis
Two weeks ago, I fixed BUG 24861 (Incorrect handling af varbinary fields) by making a special case for a BINARY collation.  The situation was the old falcon_bug_137 test within falcon_bugs.

# Mantis bug #137: Varbinary use fails if index exists
create table t4 (s1 varbinary(3) primary key);
insert into t4 values (0x41),(0x4100),(0x410000);

This test would fail because the Falcon code was trimming the string of trailing spaces and nullbytes.  I questioned whether a varbinary should have a CHARSET anyway, but made a special case for when it does.

Now, since the most recent merge with the mysql-5.1 tree, the CHARSET is no longer associated with varbinary.  Varbinary fields are given the type string when the field is defined thru the storage engine handler.  Without the CHARSET, I am not sure how to tell the difference between a varchar and a varbinary.

Varbinary fields are being declared as type string, and bytes between 0x00 and 0x20 have special meaning.  This is why this test fails.   

mysql> create table t4 (s1 varbinary(3) primary key);
mysql> insert into t4 values (0x41),(0x4100),(0x4120),(0x4121),(0x4141);
mysql> select hex(s1) from t4 where s1 > 0x41;
+---------+
| hex(s1) |
+---------+
| 4121    |
| 4141    |
+---------+
2 rows in set (0.00 sec)

In order to fix this, I need to figure out how to tell the difference between a varbinary field with no CHARSET and a varchar with no CHARSET.
[10 May 2007 3:17] Kevin Lewis
This problem has no good solutions.  Either we get inaccurate results when looking for values that have ending characters less than the space character, or we get a general drop in performance.

The problem stems from the fact that Falcon uses prefix compression.  In order to do this efficiently, all trailing blanks must be trimmed off.  In SQL, trailing blanks are only assumed or implied, but when Falcon is sent strings from the server engine, they are real.  The server explicitly pads all search strings with spaces up to the maximum byte length that those strings could psssibly be.

So the Falcon engine is in the business of stripping trailing blanks off the end of search strings in order to find them in the index.  

Consider the following strings;
CREATE TABLE t3 (a char(5), key (a));
INSERT INTO t3 VALUES (0x4200), (0x4209), (0x42), (0x4242);

According to SQL, they should be sorted in the order entered.  But Falcon stores (0x4200) and (0x42) together as the same value, befor (0x4209).  So;
SELECT hex(a) FROM t3 WHERE a < 0x42;  
   returns nothing, but should return 0x4200 & 0x4209

Falcon takes the string with the implied trailing value of 0x20 and sorts it as if it had a trailing value of 0x00, so trailing values between 0x00 and 0x20 are not found for a less than search.  

If Falcon started treating the high value of a search range, like values < 'b', as if it was looking for values < 'b ', then  it would return from the index all the records where the value  = 'b'.  This is because the code that finds and returns hits in an index has no knowledge of the type of data in the comparison.  All key entries are sortable bytewise, left to right, so searching can be done quickly.  This solution, appending a space to a search string, will find too many hits, and waste too much time checking those hits against the record.  This would be especially bad in an index that had thousands of key values = 'b'.  In other words, large numbers of duplicates would have a VERY unoptimized search.  We cannot do thisjust to find  values with trailing character > 0 and < space.  It is not worth it.

Another solution is to store at least one training space character after each string that ended with a hex value of space or higher.  So the values above would be stored as 0x42, 0x4209, 0x4220 & 0x424220.  This would cause them to be sorted correctly, but it would add one or two bytes to almost every non-duplicate key page node, depending if the character set is collated as unicode.  This is also a performance problem, increasing the size of the index.

So I am in a quandary as to how this particular bug can be fixed without causing a general performance hit, which is not allowed!!!
[14 May 2007 15:32] Kevin Lewis
As may be implied from the previous entry, this bug may not be fixable without a performance hit.  After discussions with Jim Starkey, we do not think this bug is worth fixing if it causes a performance hit.  So I would like to mark this 'Won't Fix'.

We should document that a search for a string that ends with a character who's sort weight is greater than zero and less than the implied SQL pad character (space), will work using '=' and the explicit key value.  But other searches will be problematic.
[14 May 2007 21:38] Kevin Lewis
The code that I checked in last week does not work as I expected for this test.  My intention was to be able to find the 0x4109 with a like 'a%' search.  As I look at the fix for that, I think I might be able to use the implied pad character on the highKey of a range search in order to always return the correct records.

Putting this back to 'In Progress' to see if I can make a solution that works without slowing Falcon down.
[16 May 2007 16:33] Kevin Lewis
Pushed an enhanced testcase, which still does not work.

I looked into adding a single pad character to the highKey so that it is positioned after all the values with trailing characters between 0x00 and the pad character. But this would for the inclusion of the highKey hits even if the search is field < 'value'.  I thought this would cause a performance problem, but them I found that the end points of ALL range searches are already being included even for 'less than' and 'greater than' searches where the end points are not needed.  

Jim indicated that this is necessary for indexes that contain numeric fields that may contain 'round off' differences from the exact stored value.  Falcon indexes arre designed to be less exact in resolving hits.  So the final resolve is always done against the records.  

Since the endpoints are always included in every range search, it causes no harm to add one pad character to the highKey during an index scan.  So I will pursue this solution.
[17 May 2007 19:42] Kevin Lewis
We will fix this after the beta release since it affect some very commonly called code;  scanIndex().  And the result is the ability to retrieve records using a < search when the key matches the search key but with low bytes appended, such as a tab.  In other words, the benefit is rarely needed, but the changed code is used all the time.  So the risk is much higher than the benefit.
[26 Jun 2008 16:33] Peter Gulutzan
On 2007-05-17 Kevin Lewis marked this "to be fixed later"
saying "We will fix this after the beta release ...".
Since it is now after the beta release, I set it back to
"Verified".
[26 Jun 2008 22:01] Kevin Lewis
Another character-set related bug.
[26 Sep 2008 16:59] John Embretsen
Test case for this bug. Moved from falcon_team suite. Will fail until bug is fixed.

Attachment: falcon_bug_23692.test (application/octet-stream, text), 1.39 KiB.

[26 Sep 2008 17:00] John Embretsen
Test result file associated with the attached falcon_bug_23692.test.

Attachment: falcon_bug_23692.result (application/octet-stream, text), 1.51 KiB.

[20 Feb 2009 12:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67022

3030 lars-erik.bjork@sun.com	2009-02-20
      This is a patch for bug#23692 (Falcon: searches fail if data is 0x00)
      
      The solution is to append a pad key to the upper 
      bound search key, if its last character is equal
      to or greater than the pad character. This is done
      In order to make it position after all values with 
      trailing characters lower than the pad character.
      
      For fields with a collation registered
      [if (field->collation)], there is no efficient way
      of checking if the last character is equal or greater
      than the pad character, without iterating through the
      entire key from the beginning.
      
      I have discussed this with Alexander Barkov who suggests
      to always pad the upper bound search key in these cases,
      and to pad it to the length of the key, instead of 
      appending just a single character. This way I can use the
      existing cs->coll->strnxfrm function.
      
      In the other cases, I have checked on the last byte and
      appended 0x20 if the byte was >= 0x20.
      
      I have also added one more parameter to the makeKey methods
      to say that this is a highKey.
[20 Feb 2009 21:14] Kevin Lewis
Looks like a good soluition and the code looks good.
[2 Mar 2009 14:12] Bugs System
Pushed into 6.0.11-alpha (revid:alik@sun.com-20090302140208-lfdejjbcyezlhhjt) (version source revid:vvaintroub@mysql.com-20090223172157-jllfd81jwvak4n5j) (merge vers: 6.0.11-alpha) (pib:6)
[15 May 2009 15:55] MC Brown
A note has been added to the 6.0.11 changelog: 

Searching for 0x00 in Falcon tables using the VARBINARY column type would fail to return the correct rows. In addition, a crash could be encountered when modifying a column to the VARBINARY type.