MySQL Bugs: #37595: Turkish unicode collation mistreats 'ï'

Bug #37595	Turkish unicode collation mistreats 'ï'
Submitted:	23 Jun 2008 20:34	Modified:	26 Jun 2008 17:58
Reporter:	Peter Laursen (Basic Quality Contributor)	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.0.51b	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Turkish unicode collation does not realize that in Turkish 'ï' is a distinct character.

How to repeat:
-- correct: 
set names latin5;
select 'ï' = 'i' collate latin5_turkish_ci; -- returns 0

-- incorrect:
set names utf8;
select 'ï' = 'i' collate utf8_turkish_ci; -- returns 1

Suggested fix:
Unlike the more tricky discussions here:
http://bugs.mysql.com/bug.php?id=37571

.. I think this is a *plain and simple* bug!

Thank you for the bug report. I can't repeat, how you did the test (client and
OS version)?. Thanks in advance.

Enter password:
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 2
Server version: 5.0.51b-community-nt MySQL Community Edition (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> -- correct:
mysql> set names latin5;
Query OK, 0 rows affected (0.00 sec)

mysql> select 'ï' = 'i' collate latin5_turkish_ci; -- returns 0
+-------------------------------------+
| 'ï' = 'i' collate latin5_turkish_ci |
+-------------------------------------+
|                                   0 |
+-------------------------------------+
1 row in set (0.00 sec)

mysql>
mysql> -- incorrect:
mysql> set names utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> select 'ï' = 'i' collate utf8_turkish_ci; -- returns 1
+-----------------------------------+
| 'ï' = 'i' collate utf8_turkish_ci |
+-----------------------------------+
|                                 0 |
+-----------------------------------+
1 row in set (0.00 sec)

mysql>

good point!

I can confirm that I get the same as you with command line.  I was using SQLyog.  I tried Query Browser but it overwrites (or does not send) "set names latin5" stmt, so it is not usable here.

Server and client are running on Windows.

But I think it is still weird! When executing: 

set names latin5;
select 'ï' = 'i' collate latin5_turkish_ci;
set names utf8;
select 'ï' = 'i' collate utf8_turkish_ci;

... general log reads
3 Query       set names latin5
3 Query       select 'ï' = 'i' collate latin5_turkish_ci
3 Query       set names utf8
3 Query       select 'ï' = 'i' collate utf8_turkish_ci

so the server receives the statements it seems.
And still result for 1st stmt = '0' and for 2nd stmt = '1'

It is character_set_client that matters as this shows

set names latin5;
set character_set_client = utf8;
select 'ï' = 'i' collate latin5_turkish_ci; -- I get '1'

set names utf8;
set character_set_client = latin5;
select 'ï' = 'i' collate utf8_turkish_ci; -- I get '0'

Please keep this open for a few days while I get a little more information (unless of course the provided information gives you hints)!

Please see also Query Browser.  It actully also returns '1' for the statement.

SET NAMES does not work the same with commanline and libmysql?
Is this a more general problem that just a problem wiht 'ï' and/or a specific collation?

Am I missing something?

result '1' in QB

Attachment: qb.jpg (image/jpeg, text), 26.45 KiB.

sorry .. image was wrong .. new one coming!

utf8_turkish_ci this time

Attachment: qb2.jpg (image/jpeg, text), 25.39 KiB.

Peter,

Many thanks for reporting this.
I will check it and look for Turkish collation rules as well.

@migual ..

I think you are doing a mistake here.  MySQL command line on Windows is implemented in cmd.exe.  Command.exe has issues with special characters not define by current LOCALE.

In my opinion we are back to this

SELECT 'ï' = 'i' collate utf8_<*not bin*> = 1

This may be correct/practical with general and western collations, but for those languages (Turkish, and maybe also some Balkan languages and Hungarian ??) where ï is a distinct character I think it is a bug!

One could try:
* command-line on Linux
* a small C or PHP test application 
.. but I do not have those options now!

Also this (again) raises the discussion if cmd.exe is an appropriate 'framework' for MySQL command line client(s).  I think it never was since MySQL 4.1 was released! Of course it is preferable to use the shell or shell-like interface available on every platform if it is usable .. but cmd.exe is not usable I think!

.. and sorry for misspelling your name!

Hi Peter,

I just did an initial test with cmd on my most available box at the moment, that is the reason why I asked feedback from you :). Thanks for the feedback.

Peter Laursen,

sorry that I need so long. 
I had to stop analysis here because of a real critical other issue.

There is no ï in Turkish alphabet.

There is only "ı" and "i".

Look here: http://en.wikipedia.org/wiki/Turkish_alphabet

Did you mix here something?

ooops .. yes I mismatched 'i with umlaut' and 'i with no dot' ('ï' and 'ı').  Sorrry for this

-- so this is correct
set names utf8;
select 'i' = 'ı' collate utf8_turkish_ci; -- 0

-- only some inconsequence is here
set names utf8;
select 'ï' = 'i' collate utf8_turkish_ci; -- 1
set names latin5;
select 'ï' = 'i' collate latin5_turkish_ci; -- 0

.. but that is a general issue that UMLAUTs are treated differently like that in unicode/non-unicode collations where UMLAUTs are not 'native' for that language (and same for TILDE - an maybe even more - type of accents). I also described same in my 'Danish' thread.

You can choose whether you want to adjust the synopsis accordingly or close a *not a bug*.

I am sorry for this confusion.  But there was more confusion and mistakes here, I think! ...

Peter Laursen,

I will close this bug report here.

Consider, usually the languages have only sorting rules for their own alphabet.

I agree with you that latin1 and utf8 collation that are made for the same language should behave same way for the alphabet the language is using and for which it has sorting rules.

For all other signs it quasi doesn't matter.
Lets just take a mix of English and Greek alphabet.

There is no rule in English how to sort the Greek letters. 

When you ask me: I would prefer a sorting like: 
a,&alpha;,b,ß, and so on
When you would ask somebody else I am sure he would prefer: a,b,c,...z,&alpha;,ß, ..,Ω