MySQL Bugs: #9269: bug in characterset / collation

Bug #9269	bug in characterset / collation
Submitted:	18 Mar 2005 10:20	Modified:	26 Apr 2005 0:30
Reporter:	[ name withheld ]	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Command-line Clients	Severity:	S3 (Non-critical)
Version:	4.1.10a	OS:	Windows (windows 2k)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:

Using characterset utf8 with collation utf8_spanish2_ci, the varchar 'sr' ar equal to 'srr'.

I think that it's a Bug in collation rules.

NOTE:

I'm using utf8_spanish2_ci because utf8_spanish_ci also have a similar bug: vocals acuted (by example "a" and "á") MySQL say equals! but they are distinct!
And latin1_spanish_ci also have a similar bug: n and n-tilde ("ñ"), MySQL say equals! but they are distinct!

How to repeat:

My server:

server version: 4.1.10a-nt

My configuration (only this changes in my.ini):

[client]
default-character-set=utf8

[mysqld]
default-character-set=utf8
default-collation=utf8_spanish2_ci

Test my configuration:

mysql> SHOW VARIABLES LIKE 'char%';
+--------------------------+----------------------------------------------------------------+
| Variable_name            | Value            |
+--------------------------+----------------------------------------------------------------+
| character_set_client     | utf8            |
| character_set_connection | utf8            |
| character_set_database   | utf8            |
| character_set_results    | utf8            |
| character_set_server     | utf8            |
| character_set_system     | utf8            |
| character_sets_dir       | C:\Archivos de programa\MySQL\MySQL Server 4.1\share\charsets/ |
+--------------------------+----------------------------------------------------------------+

mysql> SHOW VARIABLES LIKE 'coll%';
+----------------------+------------------+
| Variable_name        | Value            |
+----------------------+------------------+
| collation_connection | utf8_general_ci  |
| collation_database   | utf8_spanish2_ci |
| collation_server     | utf8_spanish2_ci |
+----------------------+------------------+

The bug:

mysql> CREATE DATABASE test DEFAULT CHARACTER SET utf8 COLLATE utf8_spanish2_ci;

mysql> USE test;

mysql> CREATE TABLE t (c VARCHAR(4)) DEFAULT CHARACTER SET utf8 COLLATE utf8_spanish2_ci TYPE=InnoDB;

mysql> INSERT INTO t VALUES('sr');

mysql> SELECT * FROM t WHERE c = 'srr';
+------+
| c    |
+------+
| sr   |
+------+

BUG: 'sr' <> 'srr' !!! the result must be empty!!!

Suggested fix:

I think that it's a Bug in collation rules.

Fix the spanish collation in utf8 and latin1.

The support of spanish charset/collation it's poor.

Thank you for your bug report.  I verified this bug using 4.1.11 from bk

Fixed in 4.1.11 and 5.0.4.

It appeared that in traditional Spanish collation
'RR' is not equal to 'R', as Unicode and Mimer state.

We'll go Oracle and IBM way instead:
There is no special rule to 'RR' anymore.
It is treated like R + R.

Those who used utc2_spanish2_ci and utf8_spanish2_ci should rebuild indexes.
After upgrade to 4.1.11 and 5.0.4

Noted in 4.1.11, 5.0.4 changelogs.

I want to confirm that considering R = R + R is a good fix.

Spanish is like this: RR never appears at the begining of a word. If the word begins with that sound, only one R is written. In the middle of a word R and RR represent different sounds (strong and soft r). Soft r never happens in the begining of a word, only strong r does, and is written R, not RR. (for example: roca, not *rroca; also: caro != carro)

At the begining of a word, R and RR *would* be equal, but RR is not accepted at the begining of a word so there's no big deal if you treat RR as R + R. (you won't find the case, an if you do it's wrong anyway)