Bug #34232 Unexpected results querying varchar ucs2 character set and ucs2_bin collation
Submitted: 1 Feb 2008 14:31 Modified: 29 Sep 2008 6:59
Reporter: Troy Pearson Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0.27 OS:Windows (Windows XP Professional)
Assigned to: CPU Architecture:Any
Tags: ucs2, usc2_bin, varchar

[1 Feb 2008 14:31] Troy Pearson
Description:
After inserting characters into a table with
varchar datatype
ucs2 character set
ucs2_bin collation
and index on the field

a LIKE query does not return the expected results.

This is very similar to bug number 14583 but the test case in that bug is indeed fixed and works fine.

The differences with this test case area
the characters in the field are not ASCII
the query returns one of the expected two records instead of none

How to repeat:
Note: since this page does not accept Unicode characters, the ? in the text below needs to be replaced with the actual values for the character.  They show up as square boxes in the Query Analyzer and not question marks.

The value for the first ? which is a 'T' with it's bits swapped is
0xFFAB  (T = 0x0054)  

The value for the second ? which is an 'R' with it's bits swapped is
0xFFAD  (R = 0x0052)

DROP TABLE IF EXISTS TestTable;

CREATE TABLE TestTable (
  Name varchar (255) CHARACTER SET ucs2 COLLATE ucs2_bin NOT NULL,
  CONSTRAINT MY_UNQ_UK UNIQUE(Name)
) ENGINE=InnoDB;

INSERT INTO TestTable(Name) VALUES('?');
INSERT INTO TestTable(Name) VALUES('??');

-- only the first record is returned
SELECT Name FROM TestTable WHERE Name LIKE '?%';

-- DROP INDEX MY_UNQ_INDEX
DROP INDEX MY_UNQ_UK ON TestTable;

-- both records are returned
SELECT Name FROM TestTable WHERE Name LIKE '?%';

Suggested fix:
The only workarounds we have found are to remove the index or use a query like
WHERE Name COLLATE ucs2_bin LIKE '?%'

Both of which are too slow to be usable in production.

We have complete control over the table so any other field types or collations that might work with this data should be acceptable.
[1 Feb 2008 14:33] Troy Pearson
The Unicode file that allows cut/paste to the Query Analyzer

Attachment: mySQLQueries.txt (text/plain), 1024 bytes.

[1 Feb 2008 14:34] Troy Pearson
I found the attachment feature and have attached the Unicode file.
[1 Feb 2008 16:33] MySQL Verification Team
Thank you for your bug report. This issue has been committed to our source repository of that product and will be incorporated into the next release.

If necessary, you can access the source repository and build the latest available version, including the bug fix. More information about accessing the source trees is available at

    http://dev.mysql.com/doc/en/installing-source.html

c:\dbs>5.0\bin\mysql -uroot test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.0.56-nt Source distribution

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> CREATE TABLE TestTable (
    ->   Name varchar (255) CHARACTER SET ucs2 COLLATE ucs2_bin NOT NULL,
    ->   CONSTRAINT MY_UNQ_UK UNIQUE(Name)
    -> ) ENGINE=InnoDB;
Query OK, 0 rows affected (0.17 sec)

mysql>
mysql> INSERT INTO TestTable(Name) VALUES('?');
Query OK, 1 row affected (0.03 sec)

mysql> INSERT INTO TestTable(Name) VALUES('??');
Query OK, 1 row affected (0.06 sec)

mysql> SELECT Name FROM TestTable WHERE Name LIKE '?%';
+------+
| Name |
+------+
| ?    |
| ??   |
+------+
2 rows in set (0.05 sec)

mysql>
[2 Feb 2008 14:10] Troy Pearson
If the characters are inserted correctly to reproduce the bug you should see this on the SELECT using the command line client.
mysql> select * from TestTable;
+--------+
| Name   |
+--------+
| n+    |
| n+n+ |
+--------+
2 rows in set (0.09 sec)

Notice the characters are not question marks but are encrypted characters.  It doesn't appear possible to enter these characters from the command prompt because when the insert statement is pasted from the Unicode document into the command prompt, the characters change to question marks.

In order to reproduce the problem the statements need to be pasted from the Unicode document into the Query Browser.

I've attached a ZIP file with the two screen images in it.
[2 Feb 2008 14:11] Troy Pearson
Screen shots showing incorrect behavior in Query Browser

Attachment: mySQL_Screen.ZIP (application/x-zip-compressed, text), 100.15 KiB.

[4 Feb 2008 14:16] Susanne Ebrecht
To handle character sets correct, you have to make sure of the following:

1) Your column should use the right character set. For example euckr
You can see it with:
mysql> show create table YOUR_TABLE_NAME;

2) for inserting data, your character_set_client variable and your input environment
should have the same encoding.

For example:
Make sure that your terminal is set to EUC-KR, if the character_set_client=euckr.
Now, you can insert the data.

For example: if your terminal is utf8 and your column has euckr.
Then make first:
mysql> set names utf8;
This occurs that character_set_client and some other variables will set to utf8.
Now you can insert the data by using utf8 and the system will handle, that they will
changed from utf8 to euckr. Which means, they are stored by using euckr at the table.

3) for selecting data, it's the same as for inserting data.
Look, which encoding your output environment need. Set the environment to euckr or use:
set names CHARACTER_SET_OF_YOUR_OUTPUT_ENVIRONMENT

The difference between input and output are, that you need other variables. 
For example: you need character_set_results for output but it's not necessary for input.
"Set names" will set all necessary variables to the right values.

By using this rules, you can be sure, your data are stored in the right way at the
database and you won't get problems.

If you get a weird output by using this rules, you can be sure, your stored data were
stored in the wrong way. For repairing this, it's necessary to dump the database, change
the wrong data manually at the dump and import the dump again.

MySQL 5.0.27 is very old. Please upgrade to our newest version MySQL 5.0.51a, when you still have problems by following this rules.

Please, let us know, if you still have problems by using the rules and the newest version.
[4 Feb 2008 21:51] Troy Pearson
Thank you for the information.  I could use a little more help in setting the parameters for the MySQL Query Browser.  If I can understand how this works from there then I think I'll be able to fix the data loading from my application.  My application loads the data from a text file using the mysqlimport utility.

These are the parameters that are currently set on the table and the database.

mysql> show create table TestTable;
| TestTable | CREATE TABLE `testtable` (
  `Name` varchar(255) character set ucs2 collate ucs2_bin NOT NULL,
  UNIQUE KEY `MY_UNQ_UK` (`Name`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 |

mysql> show variables like 'c%';
character_set_client     | latin1
character_set_connection | latin1
character_set_database   | utf8
character_set_filesystem | binary
character_set_results    | latin1
character_set_server     | utf8
character_set_system     | utf8
collation_connection     | latin1_swedish_ci
collation_database       | utf8_general_ci
collation_server         | utf8_general_ci

2) for inserting data, your character_set_client variable and your input environment should have the same encoding.

My input environment is the Query Browser and I want to input non-latin1 data from it.  Do I need to set the character_set_client to utf8 somehow or can I issue something like 'set names utf8' as the first thing when I open the Query Browser to make sure my input and results are set correctly?  I'm also slightly confused on if that should be set to ucs2 instead of utf8 since the column is declared as ucs2 type.

I attempted to load the data into a VARBINARY field instead of VARCHAR but any attempts to query it with wildcards like % also fail.

I have found that if I leave the 'collate ucs2_bin' off of the definition for the column then the wildcard queries appear to work but I don't know what that means.

Upgrading database versions should not be an issue but I would like to find out the proper steps necessary to execute the sample in the attached text file from the Query Browser in case it will also work on this version.
[29 Sep 2008 6:59] Susanne Ebrecht
Many thanks for your feedback. As I told before you have to set character_set_client, character_set_connection and character_set_results to the character set your environment is using. Means your file is using. You can set these three variables by using 

SET NAMES <charset>;

Figure out which encoding your file is using. Then figure out which encoding your Windows is using.

You can figure out which encoding your Windows is using by using:
CHCP
in a command prompt.

When you want to upload a file then you have to figure out the file encoding and use SET NAMES to set the character set to the encoding of the file before you upload the file.

Anyway this is not a bug. Bug system is not the appropriate forum for asking help on using MySQL products. Your problem is not the result of a bug.

Support on using our products is available both free in our forums at http://forums.mysql.com/ and for a reasonable fee direct from our skilled support engineers at http://www.mysql.com/support/

Thank you for your interest in MySQL.