Bug #9269 bug in characterset / collation
Submitted: 18 Mar 2005 10:20 Modified: 26 Apr 2005 0:30
Reporter: [ name withheld ] Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Command-line Clients Severity:S3 (Non-critical)
Version:4.1.10a OS:Windows (windows 2k)
Assigned to: Alexander Barkov CPU Architecture:Any

[18 Mar 2005 10:20] [ name withheld ]
Description:

Using characterset utf8 with collation utf8_spanish2_ci, the varchar 'sr' ar equal to 'srr'.

I think that it's a Bug in collation rules.

NOTE:

I'm using utf8_spanish2_ci because utf8_spanish_ci also have a similar bug: vocals acuted (by example "a" and "á") MySQL say equals! but they are distinct!
And latin1_spanish_ci also have a similar bug: n and n-tilde ("ñ"), MySQL say equals! but they are distinct!

How to repeat:

My server:

server version: 4.1.10a-nt

My configuration (only this changes in my.ini):

[client]
default-character-set=utf8

[mysqld]
default-character-set=utf8
default-collation=utf8_spanish2_ci

Test my configuration:

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------------------------------+
| Variable_name            | Value            |
+--------------------------+----------------------------------------------------------------+
| character_set_client     | utf8            |
| character_set_connection | utf8            |
| character_set_database   | utf8            |
| character_set_results    | utf8            |
| character_set_server     | utf8            |
| character_set_system     | utf8            |
| character_sets_dir       | C:\Archivos de programa\MySQL\MySQL Server 4.1\share\charsets/ |
+--------------------------+----------------------------------------------------------------+

mysql> SHOW VARIABLES LIKE 'coll%';
+----------------------+------------------+
| Variable_name        | Value            |
+----------------------+------------------+
| collation_connection | utf8_general_ci  |
| collation_database   | utf8_spanish2_ci |
| collation_server     | utf8_spanish2_ci |
+----------------------+------------------+

The bug:

mysql> CREATE DATABASE test DEFAULT CHARACTER SET utf8 COLLATE utf8_spanish2_ci;

mysql> USE test;

mysql> CREATE TABLE t (c VARCHAR(4)) DEFAULT CHARACTER SET utf8 COLLATE utf8_spanish2_ci TYPE=InnoDB;

mysql> INSERT INTO t VALUES('sr');

mysql> SELECT * FROM t WHERE c = 'srr';
+------+
| c    |
+------+
| sr   |
+------+

BUG: 'sr' <> 'srr' !!! the result must be empty!!!

Suggested fix:

I think that it's a Bug in collation rules.

Fix the spanish collation in utf8 and latin1.

The support of spanish charset/collation it's poor.
[24 Mar 2005 4:33] Jorge del Conde
Thank you for your bug report.  I verified this bug using 4.1.11 from bk
[4 Apr 2005 13:03] Alexander Barkov
Fixed in 4.1.11 and 5.0.4.

It appeared that in traditional Spanish collation
'RR' is not equal to 'R', as Unicode and Mimer state.

We'll go Oracle and IBM way instead:
There is no special rule to 'RR' anymore.
It is treated like R + R.

Those who used utc2_spanish2_ci and utf8_spanish2_ci should rebuild indexes.
After upgrade to 4.1.11 and 5.0.4
[26 Apr 2005 0:30] Paul DuBois
Noted in 4.1.11, 5.0.4 changelogs.
[20 Oct 2008 15:31] Leoncio Juan Ernesto López (Cholo)
I want to confirm that considering R = R + R is a good fix.

Spanish is like this: RR never appears at the begining of a word. If the word begins with that sound, only one R is written. In the middle of a word R and RR represent different sounds (strong and soft r). Soft r never happens in the begining of a word, only strong r does, and is written R, not RR. (for example: roca, not *rroca; also: caro != carro)

At the begining of a word, R and RR *would* be equal, but RR is not accepted at the begining of a word so there's no big deal if you treat RR as R + R. (you won't find the case, an if you do it's wrong anyway)