Bug #22052 Trailing spaces are not removed from UNICODE fields in an index
Submitted: 6 Sep 2006 13:04 Modified: 14 Dec 2006 3:23
Reporter: Paul McCullagh (Basic Quality Contributor) (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1, 5.0 OS:
Assigned to: Alexander Barkov CPU Architecture:Any
Tags: compression, INDEX, spaces, Unicode

[6 Sep 2006 13:04] Paul McCullagh
Description:
When an index is created on a UNICODE field, the values stored in the index include the trailing spaces. For example, if you store the value "abc" in a CHAR(200) UNICODE column, the index entry for the value takes 399 bytes, instead of 7.

As a result the disk space required by the index is much larger than it needs to be.

How to repeat:
1. Create the following table:

drop table if exists t1;
create table t1 (
 id     int,
 utext  char(200) unicode not null,
 index (utext)
);

2. Now set a breakpoint in the file mi_key.c, line 100:

 99:     FIX_LENGTH(cs, pos, length, char_length);
100:      store_key_length_inc(key,char_length);
101:      memcpy((byte*) key,pos,(size_t) char_length);

3. Now insert a row:

insert t1 values (1, "abc");

4. When the breakpoint is hit, note that 'length' and 'char_length' are set to 399.
 

Suggested fix:
The code:

      if (type != HA_KEYTYPE_NUM)
      {
	while (end > pos && end[-1] == ' ')
	  end--;
      }
      else
      {
	while (pos < end && pos[0] == ' ')
	  pos++;
      }

does not remove the trailing spaces because the code assumes a single-byte charset. This should be replaced with a charset specific "rtrim" function call.
[8 Sep 2006 9:19] Sveta Smirnova
Thank you for the report.

Verified as described on Linux using last BK sources.
[14 Sep 2006 14:45] Trudy Pelzer
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

CHAR values are fixed-size. The MySQL Reference Manual says:
"The length of a CHAR column is fixed to the length that you declare when you create the table. The length can be any value from 0 to 255. When CHAR  values are stored, they are right-padded with spaces to the specified length."
For more detail, please see:
http://dev.mysql.com/doc/refman/5.1/en/char.html
[14 Sep 2006 19:02] Trudy Pelzer
My apologies; the CHAR behaviour I described above
is for data storage and not index storage.
[18 Sep 2006 6:51] Alexander Barkov
This bug is the reason of wrong ORDER BY for an indexed column:

drop table if exists t1;
create table t1 (
  a char(10) unicode not null,
  index a (a)
);
insert into t1 values (repeat(0x201f, 10));
insert into t1 values (repeat(0x2020, 10));
insert into t1 values (repeat(0x2021, 10));
select hex(a) from t1 order by a; -- wrong result:

+------------------------------------------+
| hex(a)                                   |
+------------------------------------------+
| 2020202020202020202020202020202020202020 |
| 201F201F201F201F201F201F201F201F201F201F |
| 2021202120212021202120212021202120212021 |
+------------------------------------------+
3 rows in set (0.01 sec)

alter table t1 drop index a;

select hex(a) from t1 order by a;  -- correct result:

+------------------------------------------+
| hex(a)                                   |
+------------------------------------------+
| 201F201F201F201F201F201F201F201F201F201F |
| 2020202020202020202020202020202020202020 |
| 2021202120212021202120212021202120212021 |
+------------------------------------------+
3 rows in set (0.00 sec)

The reason of wrong order of 0x2020 ("DAGGER")
is in the mentioned code - all trailing 0x20
bytes are incorrectly stripped. 

Suggested fix: strip only 0x0020 ("SPACE") sequences.
[18 Sep 2006 10:31] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/12118

ChangeSet@1.2528, 2006-09-18 15:29:29+05:00, bar@mysql.com +4 -0
  Bug#22052 Trailing spaces are not removed from UNICODE fields in an index
  Fix: using charset-aware functions cs->cset->lengthsp() and cs->cset->fill()
  instead of single byte code which is not UCS2 compatible.
[26 Sep 2006 10:30] Sergei Glukhov
ok to push
[3 Oct 2006 9:12] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/12987

ChangeSet@1.2554, 2006-10-03 14:11:57+05:00, bar@mysql.com +4 -0
  Bug#22052 Trailing spaces are not removed from UNICODE fields in an index
    Fix: using charset-aware functions cs->cset->lengthsp() and cs->cset->fill()
    instead of single byte code which is not UCS2 compatible.
[11 Oct 2006 10:52] Alexander Barkov
Pushed into 5.0-rpl, marked as 5.0.26
[12 Oct 2006 9:16] Alexander Barkov
Pushed into 5.1-rpl, marked as 5.1.12
[16 Nov 2006 16:45] Alexander Barkov
Appeared in 4.1.23 common
Appeared in 5.0.32 common
Appeared in 5.1.14 common

Note for the doc team:

Please mention in ChangeLog this information:

Those using "CHAR(N) CHARACTER SET ucs2" columns
must recreate indexes after upgrade.
[14 Dec 2006 3:23] Jon Stephens
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

Documented bugfix for 4.1.23, 5.0.32, and 5.1.14.
[9 Jan 2007 19:48] Christian Hammers
Reading 
        [16 Nov 2006 17:45] Alexander Barkov
        "Those using "CHAR(N) CHARACTER SET ucs2" columns
        must recreate indexes after upgrade."
I wonder how that can be ensured on semi automated upgrades
like distributions providing mysql packages have to face it.

Does mysql_upgrade_shell fixes this problem or at least mentions it? 
What impact does not recreating have, will the bug just reappear or will something crash? :)

bye,

-christian-
[15 Jan 2007 14:35] Alexander Barkov
Christian,

>Does mysql_upgrade_shell fixes this problem or at least mentions it? 

Sorry, no. You need to run "ALTER TABLE t1 ENGINE=XXX" for all affected
tables to rebuild indexes.

>What impact does not recreating have, will the bug just reappear or will
>something crash? :)

It will most likely report "table t1 is corrupted, try to repair it".
[1 Feb 2007 2:46] James Day
Alexander, please confirm that only MyISAM tables are affected by this bug.

Jon, once confirmed, please do the usual modification to the changelog to add MyISAM to the description so people know which engines are affected.
[1 Feb 2007 4:54] Alexander Barkov
James, right, only MyISAM tables are affected.