Bug #37595 Turkish unicode collation mistreats 'ï'
Submitted: 23 Jun 2008 20:34 Modified: 26 Jun 2008 17:58
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.0.51b OS:Any
Assigned to: CPU Architecture:Any

[23 Jun 2008 20:34] Peter Laursen
Description:
Turkish unicode collation does not realize that in Turkish 'ï' is a distinct character.

How to repeat:
-- correct: 
set names latin5;
select 'ï' = 'i' collate latin5_turkish_ci; -- returns 0

-- incorrect:
set names utf8;
select 'ï' = 'i' collate utf8_turkish_ci; -- returns 1

Suggested fix:
Unlike the more tricky discussions here:
http://bugs.mysql.com/bug.php?id=37571

.. I think this is a *plain and simple* bug!
[23 Jun 2008 21:17] MySQL Verification Team
Thank you for the bug report. I can't repeat, how you did the test (client and
OS version)?. Thanks in advance.

Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.0.51b-community-nt MySQL Community Edition (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> -- correct:
mysql> set names latin5;
Query OK, 0 rows affected (0.00 sec)

mysql> select 'ï' = 'i' collate latin5_turkish_ci; -- returns 0
+-------------------------------------+
| 'ï' = 'i' collate latin5_turkish_ci |
+-------------------------------------+
|                                   0 |
+-------------------------------------+
1 row in set (0.00 sec)

mysql>
mysql> -- incorrect:
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> select 'ï' = 'i' collate utf8_turkish_ci; -- returns 1
+-----------------------------------+
| 'ï' = 'i' collate utf8_turkish_ci |
+-----------------------------------+
|                                 0 |
+-----------------------------------+
1 row in set (0.00 sec)

mysql>
[23 Jun 2008 22:48] Peter Laursen
good point!

I can confirm that I get the same as you with command line.  I was using SQLyog.  I tried Query Browser but it overwrites (or does not send) "set names latin5" stmt, so it is not usable here.

Server and client are running on Windows.

But I think it is still weird! When executing: 

set names latin5;
select 'ï' = 'i' collate latin5_turkish_ci;
set names utf8;
select 'ï' = 'i' collate utf8_turkish_ci;

... general log reads
3 Query       set names latin5
3 Query       select 'ï' = 'i' collate latin5_turkish_ci
3 Query       set names utf8
3 Query       select 'ï' = 'i' collate utf8_turkish_ci

so the server receives the statements it seems.
And still result for 1st stmt = '0' and for 2nd stmt = '1'

It is character_set_client that matters as this shows

set names latin5;
set character_set_client = utf8;
select 'ï' = 'i' collate latin5_turkish_ci; -- I get '1'

set names utf8;
set character_set_client = latin5;
select 'ï' = 'i' collate utf8_turkish_ci; -- I get '0'

Please keep this open for a few days while I get a little more information (unless of course the provided information gives you hints)!
[24 Jun 2008 7:48] Peter Laursen
Please see also Query Browser.  It actully also returns '1' for the statement.

SET NAMES does not work the same with commanline and libmysql?
Is this a more general problem that just a problem wiht 'ï' and/or a specific collation?

Am I missing something?
[24 Jun 2008 7:49] Peter Laursen
result '1' in QB

Attachment: qb.jpg (image/jpeg, text), 26.45 KiB.

[24 Jun 2008 8:30] Peter Laursen
sorry .. image was wrong .. new one coming!
[24 Jun 2008 8:31] Peter Laursen
utf8_turkish_ci this time

Attachment: qb2.jpg (image/jpeg, text), 25.39 KiB.

[24 Jun 2008 9:58] Susanne Ebrecht
Peter,

Many thanks for reporting this.
I will check it and look for Turkish collation rules as well.
[25 Jun 2008 9:38] Peter Laursen
@migual ..

I think you are doing a mistake here.  MySQL command line on Windows is implemented in cmd.exe.  Command.exe has issues with special characters not define by current LOCALE.

In my opinion we are back to this

SELECT 'ï' = 'i' collate utf8_<*not bin*> = 1

This may be correct/practical with general and western collations, but for those languages (Turkish, and maybe also some Balkan languages and Hungarian ??) where ï is a distinct character I think it is a bug!

One could try:
* command-line on Linux
* a small C or PHP test application 
.. but I do not have those options now!

Also this (again) raises the discussion if cmd.exe is an appropriate 'framework' for MySQL command line client(s).  I think it never was since MySQL 4.1 was released! Of course it is preferable to use the shell or shell-like interface available on every platform if it is usable .. but cmd.exe is not usable I think!
[25 Jun 2008 9:40] Peter Laursen
.. and sorry for misspelling your name!
[25 Jun 2008 11:01] MySQL Verification Team
Hi Peter,

I just did an initial test with cmd on my most available box at the moment, that is the reason why I asked feedback from you :). Thanks for the feedback.
[25 Jun 2008 12:08] Susanne Ebrecht
Peter Laursen,

sorry that I need so long. 
I had to stop analysis here because of a real critical other issue.

There is no ï in Turkish alphabet.

There is only "ı" and "i".

Look here: http://en.wikipedia.org/wiki/Turkish_alphabet

Did you mix here something?
[26 Jun 2008 13:27] Peter Laursen
ooops .. yes I mismatched 'i with umlaut' and 'i with no dot' ('ï' and 'ı').  Sorrry for this

-- so this is correct
set names utf8;
select 'i' = 'ı' collate utf8_turkish_ci; -- 0

-- only some inconsequence is here
set names utf8;
select 'ï' = 'i' collate utf8_turkish_ci; -- 1
set names latin5;
select 'ï' = 'i' collate latin5_turkish_ci; -- 0

.. but that is a general issue that UMLAUTs are treated differently like that in unicode/non-unicode collations where UMLAUTs are not 'native' for that language (and same for TILDE - an maybe even more - type of accents). I also described same in my 'Danish' thread.

You can choose whether you want to adjust the synopsis accordingly or close a *not a bug*.

I am sorry for this confusion.  But there was more confusion and mistakes here, I think! ...
[26 Jun 2008 17:58] Susanne Ebrecht
Peter Laursen,

I will close this bug report here.

Consider, usually the languages have only sorting rules for their own alphabet.

I agree with you that latin1 and utf8 collation that are made for the same language should behave same way for the alphabet the language is using and for which it has sorting rules.

For all other signs it quasi doesn't matter.
Lets just take a mix of English and Greek alphabet.

There is no rule in English how to sort the Greek letters. 

When you ask me: I would prefer a sorting like: 
a,&alpha;,b,ß, and so on
When you would ask somebody else I am sure he would prefer: a,b,c,...z,&alpha;,ß, ..,Ω