Bug #2137 server considers 'y' and 'ü' the same
Submitted: 17 Dec 2003 2:55 Modified: 19 Dec 2003 4:36
Reporter: Werner Stuerenburg Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: MyISAM storage engine Severity:S2 (Serious)
Version:4.0.15, 4.1.1 OS:Windows (Win2K)
Assigned to: Alexander Barkov CPU Architecture:Any

[17 Dec 2003 2:55] Werner Stuerenburg
Description:
By chance, a prospect found a bug through introduction of a unique empoyee code consisting of 2 characters. The codes 'dy' and 'dü' lead to a unique key error...

BTW: The category system is not appropriate here, as this applies to both MyISAM and InnoDB.

How to repeat:
DROP TABLE IF EXISTS `test_y`;
CREATE TABLE test_y (
  id smallint(5) unsigned NOT NULL auto_increment,
  val varchar(5) NOT NULL default '',
  PRIMARY KEY  (id,val),
  UNIQUE KEY val (val)
) TYPE=MyISAM COMMENT='test for dü-dy';

# or ) TYPE=MyISAM COMMENT='test for dü-dy';
# or ) TYPE=InnoDB DEFAULT CHARSET=latin1 COMMENT='test for dü-dy';
# as the case may be

#
# Dumping data for table `test_y`
#

INSERT INTO test_y VALUES (1, 'dy');
INSERT INTO test_y VALUES (1, 'dü');

# select version() -> 4.1.1-alpha-nt-log
# you will get Duplicate entry '1-dü-1' for key 1
[18 Dec 2003 8:48] Dean Ellis
INSERT INTO test_y VALUES ( null, x'6479' ), (null, x'64FC');

I am verifying this in the event that the latin1 character set should handle these values.  Otherwise, it is a character set issue (and not a bug), as latin2 or utf8 and such do not have a problem with it.
[18 Dec 2003 11:47] Alexander Barkov
Werner, Dean, in 4.1  the default collation for latin1 is latin1_swedish_ci, 
and according to Swedish rules 'y' and 'ü' are equal.
If you don't like Swedish rules, you may try another collation,
for example latin1_german1_ci:

CREATE TABLE test_y (
  id smallint(5) unsigned NOT NULL auto_increment,
  val varchar(5) NOT NULL default '',
  PRIMARY KEY  (id,val),
  UNIQUE KEY val (val)
) TYPE=MyISAM CHARSET=latin1 COLLATE latin1_german1_ci;
[19 Dec 2003 1:43] Werner Stuerenburg
as you could see, on my system, I'm using the character set latin1, but the customer told me, he was using a German character set, and if I'm not mistaken, it's called something like latin_de; anyway, I don't know about character sets, but I do know that 'y' is a common character contained in the 7 bit character set, whereas 'ü' is not, so I would really like to know about any character set capable of confusing these two -- should be really amazing, right? As far as I know, all 7 bit characters are identical in all character sets -- ASCII, but then, things might be much more complicated than I can imagine.
[19 Dec 2003 1:53] Werner Stuerenburg
Alexander, thanks a lot for pointing this out: from a systematic point of view, I'd say it is not acceptable to consider two different characters as equal. I don't know much about the Swedish sorting order, but I guess Swedish people would be confused if those two characters would be mixed up in any odd order. By the way, how do you implement this "consider different things as equal"?
[19 Dec 2003 2:38] Alexander Barkov
Werner, this is what Swedish and Finnish people do expect.
Please take a look here:
  http://www.genealogia.fi/faq/faq013e.htm

In German installation those letters are not the same.
There are two German rules implemented in MySQL, according to DIN1 and DIN2.
Note, character sets and collations implementation was significantly changed,
and names are different in 4.0 and 4.1.

They are:
  - german1 and latin1_de in 4.0
  - latin1_german1_ci and latin1_german2_ci  in 4.1.

Answering to "who do we sort different characters the same":
We use arrays that assign sorting weights to characters.
For Swedish sorting  'y' and 'ü' just have the same weights.
[19 Dec 2003 2:52] Alexander Barkov
Sorry for the typo, "who do we sort" -> "how do we sort".
[19 Dec 2003 4:21] Werner Stuerenburg
In my understanding, this issue can be closed. It's not a bug, it's a feature! ;-)
[19 Dec 2003 4:36] Alexander Barkov
I have already done it :)
The status of this ticket is "Not a Bug".