Bug #23692 Falcon: searches fail if data is 0x00
Submitted: 26 Oct 2006 22:49 Modified: 15 May 17:55
Reporter: Peter Gulutzan
Status: Closed
Category:Server: Falcon Severity:S3 (Non-critical)
Version:5.1.13-falcon-alpha-pb-debug OS:Linux (SUSE 10.0 / 64-bit)
Assigned to: Lars-Erik Bjørk Target Version:
Tags: INDEX, F_ENCODING
Triage: Triaged: D2 (Serious)

[26 Oct 2006 22:49] Peter Gulutzan
Description:
I create a Falcon table with a character column with an index.
I insert 0x00.
I search for < ''.
I see 0 rows.

How to repeat:
ChangeSet@1.2346, 2006-10-26 10:08:37-04:00

mysql> create table t62 (s1 char(1),key(s1)) engine=falcon;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t62 values (0x00);
Query OK, 1 row affected (0.00 sec)

mysql> select * from t62 where s1<'';
Empty set (0.00 sec)
[27 Oct 2006 23:16] Hakan Kuecuekyilmaz
Verified on Linux 32-bit, change set 1.2348, 2006-10-27.

Added test case falcon_bug_23692.test to 5.1-falcon tree.

Regards, Hakan
[19 Dec 2006 19:05] Peter Gulutzan
Another test case:

mysql> create table tv (s1 varbinary(5)) engine=falcon;
Query OK, 0 rows affected (0.01 sec)

mysql> create index iv on tv (s1);
Query OK, 0 rows affected (0.02 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> insert into tv values (0x02),(0x0202);
Query OK, 2 rows affected (0.00 sec)
Records: 2  Duplicates: 0  Warnings: 0

mysql> select hex(s1) from tv where s1 >= 0x02 or s1 = 0x0202;
+---------+
| hex(s1) |
+---------+
| 02      |
+---------+
1 row in set (0.01 sec)
[20 Dec 2006 1:46] Hakan Kuecuekyilmaz
Verified latest comment from Peter Gulutzan on Linux 32-bit and added additional test to
test case falcon_bug_23962.test and pushed it to 5.1-falcon tree.

Regards, Hakan
[21 Dec 2006 17:35] Kevin Lewis
Two weeks ago, I fixed BUG 24861 (Incorrect handling af varbinary fields) by making a
special case for a BINARY collation.  The situation was the old falcon_bug_137 test
within falcon_bugs.

# Mantis bug #137: Varbinary use fails if index exists
create table t4 (s1 varbinary(3) primary key);
insert into t4 values (0x41),(0x4100),(0x410000);

This test would fail because the Falcon code was trimming the string of trailing spaces
and nullbytes.  I questioned whether a varbinary should have a CHARSET anyway, but made a
special case for when it does.

Now, since the most recent merge with the mysql-5.1 tree, the CHARSET is no longer
associated with varbinary.  Varbinary fields are given the type string when the field is
defined thru the storage engine handler.  Without the CHARSET, I am not sure how to tell
the difference between a varchar and a varbinary.

Varbinary fields are being declared as type string, and bytes between 0x00 and 0x20 have
special meaning.  This is why this test fails.   

mysql> create table t4 (s1 varbinary(3) primary key);
mysql> insert into t4 values (0x41),(0x4100),(0x4120),(0x4121),(0x4141);
mysql> select hex(s1) from t4 where s1 > 0x41;
+---------+
| hex(s1) |
+---------+
| 4121    |
| 4141    |
+---------+
2 rows in set (0.00 sec)

In order to fix this, I need to figure out how to tell the difference between a varbinary
field with no CHARSET and a varchar with no CHARSET.
[10 May 2007 5:17] Kevin Lewis
This problem has no good solutions.  Either we get inaccurate results when looking for
values that have ending characters less than the space character, or we get a general
drop in performance.

The problem stems from the fact that Falcon uses prefix compression.  In order to do this
efficiently, all trailing blanks must be trimmed off.  In SQL, trailing blanks are only
assumed or implied, but when Falcon is sent strings from the server engine, they are
real.  The server explicitly pads all search strings with spaces up to the maximum byte
length that those strings could psssibly be.

So the Falcon engine is in the business of stripping trailing blanks off the end of
search strings in order to find them in the index.  

Consider the following strings;
CREATE TABLE t3 (a char(5), key (a));
INSERT INTO t3 VALUES (0x4200), (0x4209), (0x42), (0x4242);

According to SQL, they should be sorted in the order entered.  But Falcon stores (0x4200)
and (0x42) together as the same value, befor (0x4209).  So;
SELECT hex(a) FROM t3 WHERE a < 0x42;  
   returns nothing, but should return 0x4200 & 0x4209

Falcon takes the string with the implied trailing value of 0x20 and sorts it as if it had
a trailing value of 0x00, so trailing values between 0x00 and 0x20 are not found for a
less than search.  

If Falcon started treating the high value of a search range, like values < 'b', as if it
was looking for values < 'b ', then  it would return from the index all the records where
the value  = 'b'.  This is because the code that finds and returns hits in an index has no
knowledge of the type of data in the comparison.  All key entries are sortable bytewise,
left to right, so searching can be done quickly.  This solution, appending a space to a
search string, will find too many hits, and waste too much time checking those hits
against the record.  This would be especially bad in an index that had thousands of key
values = 'b'.  In other words, large numbers of duplicates would have a VERY unoptimized
search.  We cannot do thisjust to find  values with trailing character > 0 and < space. 
It is not worth it.

Another solution is to store at least one training space character after each string that
ended with a hex value of space or higher.  So the values above would be stored as 0x42,
0x4209, 0x4220 & 0x424220.  This would cause them to be sorted correctly, but it would
add one or two bytes to almost every non-duplicate key page node, depending if the
character set is collated as unicode.  This is also a performance problem, increasing the
size of the index.

So I am in a quandary as to how this particular bug can be fixed without causing a
general performance hit, which is not allowed!!!
[14 May 2007 17:32] Kevin Lewis
As may be implied from the previous entry, this bug may not be fixable without a
performance hit.  After discussions with Jim Starkey, we do not think this bug is worth
fixing if it causes a performance hit.  So I would like to mark this 'Won't Fix'.

We should document that a search for a string that ends with a character who's sort
weight is greater than zero and less than the implied SQL pad character (space), will
work using '=' and the explicit key value.  But other searches will be problematic.
[14 May 2007 23:38] Kevin Lewis
The code that I checked in last week does not work as I expected for this test.  My
intention was to be able to find the 0x4109 with a like 'a%' search.  As I look at the
fix for that, I think I might be able to use the implied pad character on the highKey of
a range search in order to always return the correct records.

Putting this back to 'In Progress' to see if I can make a solution that works without
slowing Falcon down.
[16 May 2007 18:33] Kevin Lewis
Pushed an enhanced testcase, which still does not work.

I looked into adding a single pad character to the highKey so that it is positioned after
all the values with trailing characters between 0x00 and the pad character. But this would
for the inclusion of the highKey hits even if the search is field < 'value'.  I thought
this would cause a performance problem, but them I found that the end points of ALL range
searches are already being included even for 'less than' and 'greater than' searches where
the end points are not needed.  

Jim indicated that this is necessary for indexes that contain numeric fields that may
contain 'round off' differences from the exact stored value.  Falcon indexes arre
designed to be less exact in resolving hits.  So the final resolve is always done against
the records.  

Since the endpoints are always included in every range search, it causes no harm to add
one pad character to the highKey during an index scan.  So I will pursue this solution.
[17 May 2007 21:42] Kevin Lewis
We will fix this after the beta release since it affect some very commonly called code; 
scanIndex().  And the result is the ability to retrieve records using a < search when the
key matches the search key but with low bytes appended, such as a tab.  In other words,
the benefit is rarely needed, but the changed code is used all the time.  So the risk is
much higher than the benefit.
[26 Jun 2008 18:33] Peter Gulutzan
On 2007-05-17 Kevin Lewis marked this "to be fixed later"
saying "We will fix this after the beta release ...".
Since it is now after the beta release, I set it back to
"Verified".
[27 Jun 2008 0:01] Kevin Lewis
Another character-set related bug.
[26 Sep 2008 18:59] John H. Embretsen
Test case for this bug. Moved from falcon_team suite. Will fail until bug is fixed.

Attachment: falcon_bug_23692.test (application/octet-stream, text), 1.39 KiB.

[26 Sep 2008 19:00] John H. Embretsen
Test result file associated with the attached falcon_bug_23692.test.

Attachment: falcon_bug_23692.result (application/octet-stream, text), 1.51 KiB.

[20 Feb 13:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/67022

3030 lars-erik.bjork@sun.com	2009-02-20
      This is a patch for bug#23692 (Falcon: searches fail if data is 0x00)
      
      The solution is to append a pad key to the upper 
      bound search key, if its last character is equal
      to or greater than the pad character. This is done
      In order to make it position after all values with 
      trailing characters lower than the pad character.
      
      For fields with a collation registered
      [if (field->collation)], there is no efficient way
      of checking if the last character is equal or greater
      than the pad character, without iterating through the
      entire key from the beginning.
      
      I have discussed this with Alexander Barkov who suggests
      to always pad the upper bound search key in these cases,
      and to pad it to the length of the key, instead of 
      appending just a single character. This way I can use the
      existing cs->coll->strnxfrm function.
      
      In the other cases, I have checked on the last byte and
      appended 0x20 if the byte was >= 0x20.
      
      I have also added one more parameter to the makeKey methods
      to say that this is a highKey.
[20 Feb 22:14] Kevin Lewis
Looks like a good soluition and the code looks good.
[2 Mar 15:12] Bugs System
Pushed into 6.0.11-alpha (revid:alik@sun.com-20090302140208-lfdejjbcyezlhhjt) (version
source revid:vvaintroub@mysql.com-20090223172157-jllfd81jwvak4n5j) (merge vers:
6.0.11-alpha) (pib:6)
[15 May 17:55] MC Brown
A note has been added to the 6.0.11 changelog: 

Searching for 0x00 in Falcon tables using the VARBINARY column type would fail to return
the correct rows. In addition, a crash could be encountered when modifying a column to
the VARBINARY type.