Bug #14189 VARBINARY and BINARY variables: trailing space ignored with InnoDB
Submitted: 20 Oct 2005 20:24 Modified: 25 Jan 2006 1:37
Reporter: Peter Gulutzan
Status: Closed
Category:Server: InnoDB Severity:S3 (Non-critical)
Version:5.0.16-debug OS:Linux (SUSE 9.2)
Assigned to: Target Version:

[20 Oct 2005 20:24] Peter Gulutzan
Description:
Rule: "don't ignore end space". When comparing,
you have to take end 0x20 into account, instead
of ignoring it. This is the case with MyISAM and
VARBINARY. This is not the case with InnoDB and
VARBINARY.

This test was originally mentioned in a comment for bug#9278.
Fix should be in 5.0 only.

How to repeat:
mysql> create table th8 (s1 varbinary(2),primary key (s1)) engine=innodb;
Query OK, 0 rows affected (0.02 sec)

mysql> insert into th8 values (0x4120),(0x41);
ERROR 1062 (23000): Duplicate entry 'A' for key 1
[21 Oct 2005 0:18] Heikki Tuuri
Peter G,

I have been waiting for the following strange thing to be decided before changing the
padding behavior of InnoDB in comparisons of BINARY and VARBINARY:

mysql> create table th9 (s1 binary(2),primary key (s1)) engine=myisam;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into th9 values ('a ');
Query OK, 1 row affected (0.00 sec)

mysql> select * from th9 where s1 = 'a ';
Empty set (0.00 sec)

mysql> select * from th9 where s1 = 'a';
+----+
| s1 |
+----+
| a  |
+----+
1 row in set (0.00 sec)

mysql>

Currently, space is the padding character in comparisons of VARBINARY inside InnoDB. We
could change it in 5.1. It is an incompatible change.

Regards,

Heikki
[21 Oct 2005 3:15] Peter Gulutzan
The decision is made, "for now". I am using version 5.0.16, pulled today. I get this:

mysql> create table th9 (s1 binary(2),primary key (s1)) engine=myisam;
Query OK, 0 rows affected (0.08 sec)

mysql> insert into th9 values ('a ');
Query OK, 1 row affected (0.00 sec)

mysql>  select * from th9 where s1 = 'a ';
+----+
| s1 |
+----+
| a  |
+----+
1 row in set (0.01 sec)

mysql>  select * from th9 where s1 = 'a';
Empty set (0.00 sec)
[23 Oct 2005 20:17] Heikki Tuuri
Peter G.,

thank you for the clarification. I believe it is best to fix the behavior in 5.0.xx, even
though it is an incompatible change.

Regards,

Heikki
[1 Dec 2005 10:20] Heikki Tuuri
Assigning this to Marko. We have to determine what type->prtype these MySQL types
correspond to, and modify rem0cmp.c (?) and:

data0type.ic:
/*************************************************************************
Gets the padding character code for the type. */
UNIV_INLINE
ulint
dtype_get_pad_char(
/*===============*/
                                /* out: padding character code, or
                                ULINT_UNDEFINED if no padding specified */
        dtype_t*        type)   /* in: type */
{
        if (type->mtype == DATA_CHAR
            || type->mtype == DATA_VARCHAR
            || type->mtype == DATA_BINARY
            || type->mtype == DATA_FIXBINARY
            || type->mtype == DATA_MYSQL
            || type->mtype == DATA_VARMYSQL
            || (type->mtype == DATA_BLOB
                && (type->prtype & DATA_BINARY_TYPE) == 0)) {

                /* Space is the padding character for all char and binary
                strings, and starting from 5.0.3, also for TEXT strings. */

                return((ulint)' ');
        }

        /* No padding specified */

        return(ULINT_UNDEFINED);
}

THIS IS AN INCOMPATIBLE CHANGE. The manual must be updated. Let us hope this change makes
it to 5.0.17.

Regards,

Heikki
[1 Dec 2005 10:23] Heikki Tuuri
Marko must also check the behavior in UNIQUE indexes, data storage and retrieval (how to
pad the end of the field), and FOREIGN KEY, ON UPDATE CASCADE.
[1 Dec 2005 15:53] Marko Mäkelä
The VARBINARY column type appears in InnoDB as follows:

mtype = DATA_BINARY
prtype = 63 << 16 | DATA_BINARY_TYPE | DATA_NOT_NULL | DATA_MYSQL_TRUE_VARCHAR

The charset-collation number 63 corresponds to my_charset_bin, defined in
strings/ctype-bin.c.

I will find out how VARBINARY is mapped in MySQL/InnoDB 4.1 and make sure that my
modification will not change the treatment of those fields. This modification should only
affect "true varchar" VARBINARY columns, right?
[1 Dec 2005 17:01] Marko Mäkelä
In MySQL 4.1, a VARBINARY column has the following types in InnoDB:

mtype = DATA_FIXBINARY
prtype = 63 << 16 | DATA_BINARY_TYPE | DATA_NOT_NULL | 254

I wonder why it is DATA_FIXBINARY instead of DATA_BINARY. I'll check an earlier 5.0
version to see what the types are there.

My preliminary patch only touches dtype_get_pad_char(). If type->mtype == DATA_BINARY and
(type->prtype&DATA_MYSQL_TYPE_MASK) == DATA_MYSQL_TRUE_VARCHAR, there will be no padding.
Simple INSERT and SELECT seem to work, but I have not yet tested the other cases listed
by Heikki.
[2 Dec 2005 10:43] Heikki Tuuri
Marko,

looks like 4.1 had field->type() for VARBINARY as MYSQL_TYPE_STRING. That was not very
logical.

I guess field->type() was changed to MYSQL_TYPE_VARCHAR around 5.0.3.

The big question is how do we tell which column is VARBINARY and which is VARCHAR BINARY.
Maybe we need to store also field->real_type() to prtype? Then this would not be an
INCOMPATIBLE change, since in new tables we would simply have the collation different
than in old VARBINARY tables.

Regards,

Heikki

        switch (field->type()) {
                /* NOTE that we only allow string types in DATA_MYSQL
                and DATA_VARMYSQL */
                case MYSQL_TYPE_VAR_STRING: /* old <= 4.1 VARCHAR */
                case MYSQL_TYPE_VARCHAR:    /* new >= 5.0.3 true VARCHAR */
                                        if (field->binary()) {
                                                return(DATA_BINARY);
                                        } else if (strcmp(
                                                  field->charset()->name,
                                                 "latin1_swedish_ci") == 0) {
                                                return(DATA_VARCHAR);
                                        } else {
                                                return(DATA_VARMYSQL);
                                        }
                case MYSQL_TYPE_BIT:
                case MYSQL_TYPE_STRING: if (field->binary()) {

                                                return(DATA_FIXBINARY);
                                        } else if (strcmp(
                                                   field->charset()->name,
                                                   "latin1_swedish_ci") == 0) {
                                                return(DATA_CHAR);
                                        } else {
                                                return(DATA_MYSQL);
                                        }
[2 Dec 2005 11:02] Heikki Tuuri
Marko,

looks like also the padding rules in storage have changed for BINARY and VARBINARY. We
have to fix also them.

With a quick look at the MySQL code, it might be that only BINARY and VARBINARY have
my_charset_bin as the charset-collation. Then we could identify these types from the
charset-collation. There is no field->realtype() that would designate BINARY and
VARBINARY.

Regards,

Heikki

http://dev.mysql.com/doc/refman/5.0/en/binary-varbinary.html

11.4.2. The BINARY and VARBINARY Types
The BINARY and VARBINARY types are similar to CHAR and VARCHAR, except that they contain
binary strings rather than non-binary strings. That is, they contain byte strings rather
than character strings. This means that they have no character set, and sorting and
comparison is based on the numeric values of the bytes in column values. 

The allowable maximum length is the same for BINARY and VARBINARY as it is for CHAR and
VARCHAR, except that the length for BINARY and VARBINARY is a length in bytes rather than
in characters. 

The BINARY and VARBINARY data types are distinct from the CHAR BINARY and VARCHAR BINARY
data types. For the latter types, the BINARY attribute does not cause the column to be
treated as a binary string column. Instead, it causes the binary collation for the column
character set to be used, and the column itself contains non-binary character strings
rather than binary byte strings. For example CHAR(5) BINARY is treated as CHAR(5)
CHARACTER SET latin1 COLLATE latin1_bin, assuming that the default character set is
latin1. This differs from BINARY(5), which stores 5-bytes binary strings that have no
character set or collation. 

When BINARY values are stored, they are right-padded with the pad value to the specified
length. The pad value and how it is handled is version specific: 

As of MySQL 5.0.15, the pad value is 0x00 (the zero byte). Values are right-padded with
0x00 on insert, and no trailing bytes are removed on select. All bytes are significant in
comparisons, including ORDER BY and DISTINCT operations. 0x00 bytes and spaces are
different in comparisons, with 0x00 < space. 

Example: For a BINARY(3) column, 'a ' becomes 'a \0' when inserted. 'a\0' becomes 'a\0\0'
when inserted. Both inserted values remain unchanged when selected. 

Before MySQL 5.0.15, the pad value is space. Values are right-padded with space on
insert, and trailing spaces are removed on select. Trailing spaces are ignored in
comparisons, including ORDER BY and DISTINCT operations. 0x00 bytes and spaces are
different in comparisons, with 0x00 < space. 

Example: For a BINARY(3) column, 'a ' becomes 'a  ' when inserted and 'a' when selected.
'a\0' becomes 'a\0 ' when inserted and 'a\0' when selected. 

For VARBINARY, there is no padding on insert and no bytes are stripped on select. All
bytes are significant in comparisons, including ORDER BY and DISTINCT operations. 0x00
bytes and spaces are different in comparisons, with 0x00 < space. (Exceptions: Before
MySQL 5.0.3, trailing spaces are removed when values are stored. Before MySQL 5.0.15,
trailing 0x00 bytes are removed for ORDER BY operations.) 

For those cases where trailing pad bytes are stripped or comparisons ignore them, if a
column has an index that requires unique values, inserting into the column values that
differ only in number of trailing pad bytes will result in a duplicate-key error.
[2 Dec 2005 11:06] Heikki Tuuri
I guess VARCHAR BINARY an CHAR BINARY still have to be padded with spaces in comparison.
We have to test it with MyISAM.
[2 Dec 2005 13:53] Marko Mäkelä
Hmm, interestingly MySQL 4.1 silently maps VARBINARY(2) to BINARY(2). Also, 4.1 pads
BINARY columns with 0x20, while 5.0 pads them with 0x00.

There's still this interesting thing:

create table th8w (s1 varchar(2) binary,primary key (s1)) engine=myisam;
insert into th8w values(0x41),(0x4120);
-- ERROR 1062 (23000): Duplicate entry 'A ' for key 1
insert into th8w values(0x41),(0x4100);
-- ERROR 1062 (23000): Duplicate entry 'A' for key 1

Same thing with CHAR(2) BINARY on ENGINE=MyISAM. Now, how many pad characters are there?
Shouldn't there be just one for a given data type? Or should CHARACTER SET latin1 COLLATE
latin1_bin [which BINARY apparently maps to] really group all non-printable characters in
the same equivalence class?

If ENGINE=InnoDB is specified and my patch for fixing the original problem is included,
(0x41),(0x4100) gets through for VARCHAR(2) BINARY, and (0x41),(0x4120) gets through for
CHAR(2) BINARY. I didn't test these with unmodified 5.0 source yet.
[2 Dec 2005 17:35] Heikki Tuuri
Peter G,

what happens if one defines a CHAR or VARCHAR column with my_charset_bin as the
charset-collation? Is this possible? Should it be padded with 0x00 or 0x20 in
comparison?

How do we internally recognize a field that was defined by the user as BINARY or
VARBINARY, so that we do not mix it to a CHAR or VARCHAR column?

You may need to ask from Bar about this.

Regards,

Heikki
[9 Dec 2005 0:26] Heikki Tuuri
Bar confirmed that my_charset_bin is a sure way to identify BINARY(n) and VARBINARY(n)
data types.
[12 Dec 2005 13:06] Marko Mäkelä
How should the following work:

create table t1 (a int primary key,s1 varbinary(2) not null unique) engine=innodb;
create table t2 (s1 binary(2) not null, constraint c foreign key(s1) references t1(s1) on
update cascade) engine=innodb;

insert into t1 values(1,0x4100),(2,0x41),(3,0x4120);
insert into t2 values(0x41); -- apparently equivalent to 0x4100

Currently, the latter INSERT will fail if the record (2,0x41) exists in t1.

This problem does not exist with varchar(2) and char(2) in binary collation, because 0x41
and 0x4120 as varchar(2) binary are treated equal.
[12 Dec 2005 17:08] Marko Mäkelä
My upcoming patch also removes padding from BINARY(n) columns. That removes a bug in ON
UPDATE CASCADE when a BINARY(n) column references a VARBINARY(m) column.
[13 Dec 2005 22:40] Marko Mäkelä
The fix causes the following incompatible change:

Tables created with MySQL 4.1 or 5.0 that contain indexes on VARBINARY columns with
varying content length need to be dumped and reloaded. Note that MySQL 4.1.x and early
versions of MySQL 5.0 stripped trailing 0x20 bytes from VARBINARY columns. It is the
stored length that matters here.

Before 5.0.18 (or whichever version the fix will appear in), VARBINARY columns were
padded with 0x20 bytes for comparison. After the fix, VARBINARY columns will not be
padded.

The fix also affects the treatment of the BINARY(n) columns in foreign key constraints.
Previously, a BINARY(n) column could refer to a BINARY(m) column or to a VARBINARY column
value of length smaller than n. After the fix, BINARY(n) can only refer to BINARY(n) or to
VARBINARY values that are exactly n bytes long.
[16 Jan 2006 13:17] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/1135
[17 Jan 2006 17:19] Alexander Ivanov
Fixed in 5.0.19.
[25 Jan 2006 1:37] Jon Stephens
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html

Additional info:

Documented bugfix/feature change in 5.0.19 changelog, updated InnoDB and Datatypes
chapters of 5.0 Manual. Closed.
[18 May 2006 3:31] James Day
Marko,

Did this change cause any change in the data storage inside the InnoDB tablespaces on
disk?

MySQL Support gets questions about downgrading from say 5.0.21 to 5.0.17, so we need to
have any file format backwards compatibility issues documented somewhere so we can give a
correct answer.