Bug #9604 Database does not find records when searching using utf8_general_ci collation
Submitted: 4 Apr 2005 11:21 Modified: 8 Jan 2009 15:18
Reporter: Maciek Dobrzanski Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:4.1.10 OS:Linux (Linux 2.6)
Assigned to: CPU Architecture:Any

[4 Apr 2005 11:21] Maciek Dobrzanski
Description:
There is a problem with mapping polish latin2 and cp1250 character to its multilingual equivalent when searching using utf8_general_ci collation: uppercase '&#321;' (0xb3) <-> 'L' and lowercase '&#322;' (0xa3) <-> 'l'. The characters have the same ASCII codes in both latin2 and cp1250 standards. Other polish language specific characters are mapped properly.

How to repeat:
mysql> SET CHARACTER SET latin2;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE `test` ( `txt` varchar(1) character set utf8 collate utf8_bin NOT NULL default '', PRIMARY KEY (`txt`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Query OK, 0 rows affected (1.58 sec)

mysql> INSERT INTO test VALUES (_latin2'a'), (_latin2'±'), (_latin2'c'), (_latin2'&#263;'), (_latin2'&#322;'), (_latin2'l');
Query OK, 6 rows affected (0.00 sec)
Records: 6  Duplicates: 0  Warnings: 0

mysql> SELECT * FROM test WHERE txt = CONVERT(_latin2'a' USING utf8) COLLATE utf8_general_ci;
+-----+
| txt |
+-----+
| a   |
| ±   |
+-----+
2 rows in set (0.00 sec)
The query has returned 2 rows - OK

mysql> SELECT * FROM test WHERE txt = CONVERT(_latin2'±' USING utf8) COLLATE utf8_general_ci;
+-----+
| txt |
+-----+
| a   |
| ±   |
+-----+
2 rows in set (0.01 sec)
The query has returned 2 rows - OK

mysql> SELECT * FROM test WHERE txt = CONVERT(_latin2'c' USING utf8) COLLATE utf8_general_ci;
+-----+
| txt |
+-----+
| c   |
| &#263;   |
+-----+
2 rows in set (0.00 sec)
The query has returned 2 rows - OK

mysql> SELECT * FROM test WHERE txt = CONVERT(_latin2'l' USING utf8) COLLATE utf8_general_ci;
+-----+
| txt |
+-----+
| l   |
+-----+
1 row in set (0.00 sec)
!!! The query has returned only 1 row - the record returned by the query below is MISSING !!!

mysql> SELECT * FROM test WHERE txt = CONVERT(_latin2'&#322;' USING utf8) COLLATE utf8_general_ci;
+-----+
| txt |
+-----+
| &#322;   |
+-----+
1 row in set (0.00 sec)
!!! The query has returned only 1 row - the record returned by the previous query is MISSING !!!
[24 Apr 2005 21:34] Hartmut Holzgraefe
is this the same probelm as in bug 9476?

http://bugs.mysql.com/9476
[25 Apr 2005 8:01] Maciek Dobrzanski
No, they do not seem to be the same. In my case MySQL incorrectly maps single character, thus affecting results of the comparsion operartors '=', 'LIKE' and 'MATCH ... AGAINST' and probably others, which are performend on utf8 fields in utf8_general_ci collation.

It's similar to http://bugs.mysql.com/bug.php?id=7878 but not exactly the same bug. And it's still present even in the latest 5.0.4-nt.
[25 Apr 2005 17:16] Maciek Dobrzanski
Test case for this bug.

Attachment: test.sql (application/octet-stream, text), 1.31 KiB.

[25 Apr 2005 17:17] Maciek Dobrzanski
I have added a test case file for your convenience.
[25 May 2005 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[26 May 2005 8:12] Maciek Dobrzanski
How on earth this bug has been closed because of 'no feedback' reason? I gave you guys an answer to your question almost two months ago, and even the complete test case. It seems I'm just wasting my time doing this.
[26 May 2005 8:14] Maciek Dobrzanski
This bug is not fixed.
[15 Jun 2005 22:19] Peter Gulutzan
Looking at this with MySQL Version 5.0.8 and 4.1.13, I find that ordering of special Polish characters (such as L with stroke, or O with acute) appears to be correct if one uses utf8_polish_ci. There is no statement in the manual (section 10.11.1 Unicode Character Sets) that ordering would be "correct" (for Polish) if one uses utf8_general_ci. Accordingly I am changing this to "not a bug". I apologize for the delay in responding.
[13 Jul 2005 9:13] Maciek Dobrzanski
The problem is not the ordering of the Polish accented characters, which is probably fine. The problem is that 'ł' is not equal to 'l' both lowercase and uppercase when comparsion is done in utf8_general_ci collation.

Please take a look at bug http://bugs.mysql.com/bug.php?id=9607 and the "additional info" part of the final reply from Alexander Barkov. ł and l should be recognized as two different characters when collation is set to utf8_polish_ci, but not when utf8_general_ci is in use.
[16 Sep 2005 19:58] Gerardo Narvaja
Could not reproduce the error with the script that was submitted. I kept getting 'duplicate key' while trying to insert the values.
[22 Sep 2005 16:55] Valeriy Kravchuk
I tried to use test.sql script provided and got the following results on Fedora Core 1 with 4.1.15-debug:

mysql> source /home/openxs/test.sql
Query OK, 0 rows affected (0,28 sec)
 
Query OK, 0 rows affected (0,00 sec)
 
Query OK, 1 row affected (0,07 sec)
 
Query OK, 1 row affected (0,00 sec)
 
Query OK, 1 row affected (0,08 sec)
 
ERROR 1062 (23000): Duplicate entry '?' for key 1
Query OK, 1 row affected (0,07 sec)
 
ERROR 1062 (23000): Duplicate entry '?' for key 1
+-----------------------------------------------+
| info                                          |
+-----------------------------------------------+
| The query should return two rows: "a" and "?" |
+-----------------------------------------------+
1 row in set (0,07 sec)
... 

So, as previously noted, I simply can't insert 2 of your rows because of the duplicate keys errors. So, the results of test.sql run is not as yours, because table is not the same...

Would you, please, provide some additional information about your environment, namely, values of LANG, LANGVAR and LC_* environment variables. My LANG was ru_RU.UTF-8.
[22 Sep 2005 16:55] Valeriy Kravchuk
I tried to use test.sql script provided and got the following results on Fedora Core 1 with 4.1.15-debug:

mysql> source /home/openxs/test.sql
Query OK, 0 rows affected (0,28 sec)
 
Query OK, 0 rows affected (0,00 sec)
 
Query OK, 1 row affected (0,07 sec)
 
Query OK, 1 row affected (0,00 sec)
 
Query OK, 1 row affected (0,08 sec)
 
ERROR 1062 (23000): Duplicate entry '?' for key 1
Query OK, 1 row affected (0,07 sec)
 
ERROR 1062 (23000): Duplicate entry '?' for key 1
+-----------------------------------------------+
| info                                          |
+-----------------------------------------------+
| The query should return two rows: "a" and "?" |
+-----------------------------------------------+
1 row in set (0,07 sec)
... 

So, as previously noted, I simply can't insert 2 of your rows because of the duplicate keys errors. So, the results of test.sql run is not as yours, because table is not the same...

Would you, please, provide some additional information about your environment, namely, values of LANG, LANGVAR and LC_* environment variables. My LANG was ru_RU.UTF-8.
[22 Oct 2005 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[17 May 2006 9:06] Michal Koscielny
This bug still exists in mysql 5.0.15. Here is a simple way to reproduce it, run this query:

SELECT 'l' = 'ł' COLLATE utf8_general_ci;

or:

SELECT 'L' = 'Ł' COLLATE utf8_general_ci

The result is 0 in both cases whereas it should be 1! The letter 'ł' (uppercase 'Ł') is the only accented Polish letter that doesn't correctly compare to its unaccented equivalent in utf8_general_ci. There are others like 'ą', 'ć', 'ę', 'ń', etc. and they work fine:

SELECT 'a' = 'ą' COLLATE utf8_general_ci

SELECT 'c' = 'ć' COLLATE utf8_general_ci

return 1. 'L' is the unaccented equivalent of 'Ł' in the same way as 'A' is the unaccented equivalent of 'Ą' - this is how it is in Polish language and this should be fixed as developers have to implement ugly workarounds for this inconsistency.
[4 Jan 2009 2:42] Rafa‚ Wrzeszcz
I confirm, this is a bug. Many polish developers who I was talking with, confirmed that and also I came here because it happens to me. Yes "Ł" should be found as "L" (and vice versa and also lower letters). So just please somehow fix that and stop trying to tell us that "it is feature not a bug" as it is obviously a bug.

I managed to reproduce this bug on my production server (5.0.11) and development server (5.0.67). All other characters work well, but just not this one.

C'mon it's (nearly) 4 years already! It is very serious in many cases.
[7 Jan 2009 14:06] Rafa‚ Wrzeszcz
Upgraded local version to 5.1.30. Same BUG! >: Please wake up. It is really important in many cases like mapping URLs for example.
[8 Jan 2009 14:37] Susanne Ebrecht
Many thanks for pointing this out.

I will make two statements here. First I will show you my test and then I will comment it:

Here is my test and the results:

I used Mysql 5.1 and my environment encoding is utf8 so I used client encoding utf8 to store all right in all different charsets like utf8, latin2, cp1250 and ucs2:

SET NAMES UTF8;

DROP TABLE if exists t1;

CREATE TABLE t1(
col_utf8_general_ci varchar(10)
character set utf8 collate utf8_general_ci,
col_utf8_unicode_ci varchar(10)
character set utf8 collate utf8_unicode_ci,
col_utf8_bin varchar(10)
character set utf8 collate utf8_bin,
col_utf8_polish_ci varchar(10)
character set utf8 collate utf8_polish_ci,
col_latin2_general_ci varchar(10)
character set latin2 collate latin2_general_ci,
col_latin2_bin varchar(10)
character set latin2 collate latin2_bin,
col_cp1250_general_ci varchar(10)
character set cp1250 collate cp1250_general_ci,
col_cp1250_polish_ci varchar(10)
character set cp1250 collate cp1250_polish_ci,
col_ucs2_bin varchar(10)
character set ucs2 collate ucs2_bin,
col_ucs2_unicode_ci varchar(10)
character set ucs2 collate ucs2_unicode_ci,
col_ucs2_polish_ci varchar(10)
character set ucs2 collate ucs2_polish_ci);

insert into t1 values(
'l','l','l','l','l',
'l','l','l','l','l','l'),
('ł','ł','ł','ł','ł',
'ł','ł','ł','ł','ł','ł'),
('L','L','L','L','L',
'L','L','L','L','L','L'),
('Ł','Ł','Ł','Ł','Ł',
'Ł','Ł','Ł','Ł','Ł','Ł');

select col_utf8_general_ci from t1 where col_utf8_general_ci = 'l';
select col_utf8_general_ci from t1 where col_utf8_general_ci like 'l';
select col_utf8_general_ci from t1 where col_utf8_general_ci = 'ł';
select col_utf8_general_ci from t1 where col_utf8_general_ci like 'ł';

select col_utf8_unicode_ci from t1 where col_utf8_unicode_ci = 'l';
select col_utf8_unicode_ci from t1 where col_utf8_unicode_ci like 'l';
select col_utf8_unicode_ci from t1 where col_utf8_unicode_ci = 'ł';
select col_utf8_unicode_ci from t1 where col_utf8_unicode_ci like 'ł';

select col_utf8_bin from t1 where col_utf8_bin = 'l';
select col_utf8_bin from t1 where col_utf8_bin like 'l';
select col_utf8_bin from t1 where col_utf8_bin = 'ł';
select col_utf8_bin from t1 where col_utf8_bin like 'ł';

select col_utf8_polish_ci from t1 where col_utf8_polish_ci = 'l';
select col_utf8_polish_ci from t1 where col_utf8_polish_ci like 'l';
select col_utf8_polish_ci from t1 where col_utf8_polish_ci = 'ł';
select col_utf8_polish_ci from t1 where col_utf8_polish_ci like 'ł';

select col_latin2_general_ci from t1 where col_latin2_general_ci = 'l';
select col_latin2_general_ci from t1 where col_latin2_general_ci like 'l';
select col_latin2_general_ci from t1 where col_latin2_general_ci = 'ł';
select col_latin2_general_ci from t1 where col_latin2_general_ci like 'ł';

select col_latin2_bin from t1 where col_latin2_bin = 'l';
select col_latin2_bin from t1 where col_latin2_bin like 'l';
select col_latin2_bin from t1 where col_latin2_bin = 'ł';
select col_latin2_bin from t1 where col_latin2_bin like 'ł';

select col_cp1250_general_ci from t1 where col_cp1250_general_ci = 'l';
select col_cp1250_general_ci from t1 where col_cp1250_general_ci like 'l';
select col_cp1250_general_ci from t1 where col_cp1250_general_ci = 'ł';
select col_cp1250_general_ci from t1 where col_cp1250_general_ci like 'ł';

select col_cp1250_polish_ci from t1 where col_cp1250_polish_ci = 'l';
select col_cp1250_polish_ci from t1 where col_cp1250_polish_ci like 'l';
select col_cp1250_polish_ci from t1 where col_cp1250_polish_ci = 'ł';
select col_cp1250_polish_ci from t1 where col_cp1250_polish_ci like 'ł';

select col_ucs2_bin from t1 where col_ucs2_bin = 'l';
select col_ucs2_bin from t1 where col_ucs2_bin like 'l';
select col_ucs2_bin from t1 where col_ucs2_bin = 'ł';
select col_ucs2_bin from t1 where col_ucs2_bin like 'ł';

select col_ucs2_unicode_ci from t1 where col_ucs2_unicode_ci = 'l';
select col_ucs2_unicode_ci from t1 where col_ucs2_unicode_ci like 'l';
select col_ucs2_unicode_ci from t1 where col_ucs2_unicode_ci = 'ł';
select col_ucs2_unicode_ci from t1 where col_ucs2_unicode_ci like 'ł';

select col_ucs2_polish_ci from t1 where col_ucs2_polish_ci = 'l';
select col_ucs2_polish_ci from t1 where col_ucs2_polish_ci like 'l';
select col_ucs2_polish_ci from t1 where col_ucs2_polish_ci = 'ł';
select col_ucs2_polish_ci from t1 where col_ucs2_polish_ci like 'ł';

Let me summarise the results:

All results were the same for LIKE and '=', so I will show you only one of them.

These columns have had the same result:

col_utf8_bin,
col_latin2_bin,
col_ucs2_bin

select <column name> from t1 where <column name> like 'l';
+----------------+
| <column name>  |
+----------------+
| l              | 
+----------------+
select <column name> from t1 where <column name> like 'ł';
+----------------+
| <column name>  |
+----------------+
| ł              | 
+----------------+

Because the collations were binary collations this is a totally correct result.

The following seven different collation have had the following result:

col_utf8_general_ci, 
col_utf8_unicode_ci,
col_utf8_polish_ci,
col_latin2_general_ci,
col_cp1250_polish_ci,
col_ucs2_unicode_ci,
col_ucs2_polish_ci

select <column name> from t1 where <column name> like 'l';
+-------------------+
| <column name>     |
+-------------------+
| l                 | 
| L                 | 
+-------------------+
select <column name> from t1 where <column name> like 'ł';
+-------------------+
| <column name>     |
+-------------------+
| ł                 | 
| Ł                 | 
+-------------------+

Here I will say more in my second statement ...

The collation cp1250_general_ci was different:

select col_cp1250_general_ci from t1 where col_cp1250_general_ci like 'l';
+-----------------------+
| col_cp1250_general_ci |
+-----------------------+
| l                     | 
| ł                     | 
| L                     | 
| Ł                     | 
+-----------------------+

select col_cp1250_general_ci from t1 where col_cp1250_general_ci like 'ł';
+-----------------------+
| col_cp1250_general_ci |
+-----------------------+
| l                     | 
| ł                     | 
| L                     | 
| Ł                     | 
+-----------------------+

This could be a Microsoft rule ...

Just to look which behaviour is more common I looked into libc (LC_COLLATE) by using another DBMS (locale under linux: pl_PL.utf8):

select v from t where lower(v) like 'l';
Results: l and L
select v from t where lower(v) like 'ł';
Results: ł and Ł

This shows also collation implementation of libc make a difference between l and ł.
[8 Jan 2009 15:18] Susanne Ebrecht
Summary:

There is non bug in utf8_general_ci, utf8_unicode_ci and ucs2_unicode_ci. These three are working correct according to the Unicode Technical Standard (Unicode Collation Algorithm): http://unicode.org/reports/tr10/tr10-17.html

You will find there:
"Data tables for 4.1.0 contain the following changes:"
...
"The change of weights for characters Æ, Ǽ, Ǣ; Đ, Ð; Ħ; Ł, Ŀ; and Ø, Ǿ (and their lowercase and accented forms) to have secondary (accent) differences from AE; D; H; L; and O, respectively. This is to provide a much better default for languages in which those characters are not tailored. See also the section on user expectations."

This means that l/L is not equal to ł/Ł. The letters l and L have a higher priority as ł and Ł and so ł will follow l in sorting algorithm.

There is also not a bug latin2_general_ci. Language like Czech have published sorting rules and they really don't want that ł is equal to l. They want a differentiation here.

The collation cp1250_general_ci could have a bug. But cp1250 is a Microsoft charset and we looked to the Microsoft collation implementation here and there ł is treated as l. So also here we don't have a bug.

We have three separate collations for Polish:
utf8_polish_ci, cp1250_polish_ci and ucs2_polish_ci

All three behave the same way and this way is the same as the unicode standard way; ł is not equal to l.

This was implemented that way because we got complains about ł is equal to l in cp1250_general_ci at the past from Polish users.

When you look into wikipedia.org you will find several collation rules for several languages like Czech or German or whatever but non rules for Polish.

Also Google is not giving a result for official sorting rules for Polish translated to English.

Unfortunately, we don't have Polish employees and we can't find official Polish sorting rules.

Also the opinions about Polish collation seems to be split into two groups here. Some complain that ł and l aren't equal others complain that they are equal. What is right?

We won't change the behaviour of the Polish collation as long as nobody from the community is able to provide us a link with the official Polish sorting rules.

As long as we don't have official rules here, this is not a bug. We took here what is most common and given in unicode standard. My tests with libc showed that the behaviour ł is not equal to l is common in Unix/Linux world and lots of other software and also other RDBMS are sorting like we do it.

I will set this to Not a Bug now. Please feel free to open the bug report again when you have a link for us or will be able to add a file here with the official Polish sorting rules. Of course the optimum would be when this is given in English, but most countries only provide this rule in native languages. So it doesn't matter when these rules are in Polish, I am sure we will find an employee who is able to speak Polish enough to translate them.
[10 Jan 2009 0:37] Michal Koscielny
Susanne, thanks for explaining your reasoning but I don't think you fully understand this issue. You will not find any Polish rules saying that ł is equal to l in sorting because it simply is not. ł always follows l in sorting as all other accented characters follow their non-accented counterparts. I don't know how official this is for you but here http://en.wikipedia.org/wiki/Polish_language you can find a table will all Polish characters in their correct sorting order, something that we are all taught here in primary schools. You say:

> We have three separate collations for Polish:
> utf8_polish_ci, cp1250_polish_ci and ucs2_polish_ci
>
> All three behave the same way and this way is the
> same as the unicode standard way; ł is not equal to l.
>
> This was implemented that way because we got complains
> about ł is equal to l in cp1250_general_ci at the past from Polish users

And that is fully correct, ł is not equal to l and nobody from us wants mysql to treat them as equal in any of the Polish collations, that is utf8_polish_ci, cp1250_polish_ci and ucs2_polish_ci - so these are implemented correctly.

The only problem is with utf8_general_ci, a non-Polish collation, which treats accented letters as equal to their unaccented counterparts, so all of these will return 1:

SELECT 'a' = 'ą' COLLATE utf8_general_ci;
SELECT 'c' = 'ć' COLLATE utf8_general_ci;
SELECT 'e' = 'ę' COLLATE utf8_general_ci;
SELECT 'n' = 'ń' COLLATE utf8_general_ci;
SELECT 'o' = 'ó' COLLATE utf8_general_ci;

The only exception is this:
SELECT 'l' = 'ł' COLLATE utf8_general_ci;

which returns 0. Can you see the inconsistency? I realize mysql developers try to follow some unicode standards but I wouldn't be surprised if the standards were buggy themselves since it's possible not many Polish people worked on them or verified them. But for any of us Poles this is plain obvious.

Now please take the time to make another small simple test, which should make it more obvious to you. Create a simple table with one char or varchar column having utf8_general_ci collation:

CREATE TABLE letters (
  letter char(1) COLLATE=utf8_polish_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

And fill it with all letters from the Polish alphabet, one in each row. And then run this:

SELECT letter FROM letters ORDER BY letter;

Where would you expect the letter ł to be? It comes last! At the very end after z, ź and ż. This contradicts all sorting rules and common sense as well. I hope this will make it clear enough for you as a non-Polish speaking person. ł and l need to be made equal but only in utf8_general_ci, in all Polish collations they should be different (and they already are).

I don't know how to reopen this bug, perhaps I cannot do it as a non-developer so please can someone do it? It would be good if more Polish people contributed to this thread if my and other people's comments were not enough to illustrate it clearly.
[10 Jan 2009 0:44] Michal Koscielny
SORRY, I made a mistake in the sql create statement above, it should be like this:

CREATE TABLE letters (
  letter char(1) COLLATE=utf8_general_ci NOT NULL
) ENGINE=MyISAM DEFAULT CHARSET=utf8;

Obviously, this column should be utf8_general_ci and not utf8_polish_ci to illustrate my point.
[18 Jan 2009 2:02] James Czuchra
I have encountered the same problem that Mr. Koscielny describes.  It's true that the behavior is consistent with the collation standard.  The bigger question is, how can we change the collation so that it does what we expect it to do?

We need ALL the Polish characters to equate to their English equivalents in utf8_general_ci.  It makes no sense to have ignored this one character.
[4 Nov 2009 21:54] Peter Gulutzan
I support the judgment of Susanne Ebrecht.

Some people might be satisfied to learn that there has been
a change to the Unicode Collation Algorithm rules, and we
will support that (for utf8_unicode_ci not utf8_general_ci).
To see the description of our worklog task, look at
http://forge.mysql.com/worklog/task.php?id=2673
and click "High Level Architecture".
[18 Feb 2010 8:44] Marek Kmiecik
Polish sorting rules is almost the same as English.
Letter order (upper and lower have the same weigth) list:
A a, Ą ą, B b, C c, Ć ć, D d, E e, F f, G g, H h, I i, J j, K k, L l, Ł ł, M m, N n, O o, Ó ó, P p, R r, S s, Ś ś, T t, U u, W w, Y y, Z z, Ź ź, Ż ż

http://en.wikipedia.org/wiki/Polish_alphabet

Upper and lower letters have the same weigth. For example, if first letter is the same (whatever upper or lower), the order is determined on the basis of second character in the word. When two first letters are the same, then third letter determie order of words. The more important is first letter, if in several word is the same we sort by second letter. And so on.

Samle:
Aaram
Abram
Abyka
Acer
Ackow
Ącki
Ącow
......
.....
and so on

Please correct this bug collate "Ł" to and "ł" to "l" in utf8_general_ci. When I search "l", the words with "ł" are not finding.
[22 Apr 2011 19:56] Pierre Timmermans
This is really annoying. The following author was inserted in my database (books database) via a lookup in the Bnf (bibliothèque nationale de France) : Czesław Miłosz.

Now, when one search my site for the string Milosz (we can't even type a ł on our keyboards), he won't find the entry. My collation is utf8_general_ci

From what I read above, it is obviously a bug. In any case, do a search on google for Czeslaw Milosz, and you will find the entry, so it proves that it is the expected behavior.
[20 Aug 2011 0:25] Marcin Wasilewski
Dear Susanne,

I am surprised to find this to be such controversy whereas this issue is truly straightforward for every Pole. In times when accent-insensitive auto-completion is key part of any major website or program, this behaviour is just unacceptable. It is not difficult to imagine the additional overhead of the queries that working around this problem produce. 

Anyway, I found an article in wikipedia which clearly describes the collation rules in Polish language: 

"In Polish, specifically Polish letters derived from the Latin alphabet are collated after their originals: A, Ą, B, C, Ć, D, E, Ę, ..., L, Ł, M, N, Ń, O, Ó, P, ..., S, Ś, T, ..., Z, Ź, Ż. The digraphs for collation purposes are treated as if they were two separate letters."

Source: http://en.wikipedia.org/wiki/Collating_sequence

The letter "Ł" is indeed derived from the letter "L" and was introduced as a separate letter in polish language around XVI century, rarely used until just a century ago. Many elderlies in Poland still pronounce words with "ł" as they were written with "l". This is all documented in the wikipedia article on the letter in question which I highly encourage you to read - http://en.wikipedia.org/wiki/%C5%81

It proved difficult for me to acquire other sources of information beyond those mentioned in wikipedia articles but if someone could point me in the right direction - what specifically is required by MySQL's standards  - I would be willing to explore this issue further.
[21 Aug 2011 21:37] Peter Gulutzan
If there was any evidence that MySQL's Polish
collation -- which is utf8_polish_ci not
utf8_general_ci -- did not follow standard
Polish rules, there could be a valid feature
request for a new collation. However, this
is not a bug.
[21 Aug 2011 21:48] Rafa‚ Wrzeszcz
What?! Are you kidding me? It _IS_ a bug! What evidence you want more then we shown you? We know how our language works and we are telling you it _IS_ wrong, but you know it better? How obvious it is, yet how unclear to you?
[21 Aug 2011 22:50] Marcin Wasilewski
Dear Peter,

Thank you for your answer but the subject of concern are both utf8_general_ci and utf8_unicode_ci, not the polish one which is indeed correct: the general collation puts ł at the end of the alphabet and both of them differentiate between l and ł.

Here is a tool which will help you discover the proper sorting:
http://demo.icu-project.org/icu-bin/locexp?_=en_US&x=col

While I understand you want to comply with the Unicode standards, I would like to remind you that the report quoted here previously (http://unicode.org/reports/tr10/tr10-17.html#Tailoring) is only a draft document and is prone to change. Furthermore, I find it absurd how the different weight of ł "is to provide a much better default for languages in which those characters are not tailored". I could probably agree with the rest of this revision as it is mostly about characters that are widely used across the world (Æ,Ø), however according to wikipedia, the letter ł does not exist in any major language other than polish. Which is also why I am so surprised why Czech, or anyone else really would care about either this particular differentiation or sorting. I would be grateful if anyone could explain this approach provided that I failed to grasp something here. 

I completely understand that it is the easiest to comply with a written standard but I believe we have already proven that this standard is wrong. On my end, I will try to contact the authors of the aforementioned report, nevertheless it would be great if somebody here took a closer look. I believe I prepared all the evidence needed.

"In countries where Ł is not available, basic L is used instead. Thus, the surname Małecki would be spelled Malecki in a foreign country. Similarly, the stroke is sometimes omitted on the internet, as may happen with all diacritic-enhanced letters."

Source: http://en.wikipedia.org/wiki/%C5%81
[22 Aug 2011 16:56] Peter Gulutzan
It seems to me that everything has been adequately
explained multiple times. But comments still come.
So I'll try once more, explaining in a different way.

We have these collations and rules for L with stroke:
utf8_polish_ci      L with stroke greater than L and less than M
utf8_unicode_ci     L with stroke greater than L and less than M
utf8_unicode_520_ci L with stroke equal to L
utf8_general_ci     L with stroke greater than Z

For utf8_polish_ci: The tailoring guide that we intend
to follow for most situations is CLDR (Common Locale Data
Repository) http://cldr.unicode.org/index/process.
The relevant file for Polish is pl.xml. The rule there is
that L with stroke greater than L and less than M.
MySQL follows this rule. Commenters who refer to
Polish rules are therefore pointing to rules which
MySQL already follows. When people ask us to do what
we're already doing, it is "not a bug".

For utf8_unicode_ci: The default Unicode collation
that we followed when developing utf8_unicode_ci was
Unicode Collation Algorithm 4.0.
http://www.unicode.org/Public/UCA/4.0.0/allkeys-4.0.0.txt
In that version L with stroke greater than L and
less than M. It is documented in our reference manual
that utf8_unicode_ci follows version 4.0 of the
standard. When people object to behaviour which
does what the manual says according to design
specifications which have valid basis, it is "not a bug".

For utf8_unicode_520_ci: The default Unicode collation
that we followed when developing utf8_unicode_520_ci was
(as the substring '520' is meant to imply) Unicode
Collation Algorithm 5.2.
http://www.unicode.org/Public/UCA/5.2.0/allkeys.txt
In that version L with stroke equal to L.
That is, the Unicode standard changed, and eventually
MySQL developed an appropriate collation which follows
it -- in MySQL 5.6. MySQL 5.6 is only available in
our milestones, that is, you can download it and it
works, but it lacks our "generally available" status.
Nevertheless, this is what some commenters on this
thread have asked for, and they're getting it. When
MySQL introduces a feature in order to follow the
Unicode standard and to respond to users' requests,
it is "not a bug".

For utf8_general_ci: this is an old collation which
sometimes works more quickly than other collations,
but does not follow standard or language-specific
rules for some characters, such as L with stroke.
Of course we never said it does, so on that basis
alone it is "not a bug".

Why don't we just change the rules for utf8_general_ci,
instead of introducing new collations with new rules?
Well, as a matter of fact, that happened for another
rule affecting German Sharp S in our version 5.1.
The results were catastrophic, because collation
affects order of keys in an index, and when index
keys are out of the expected order then searches
fail (Bug#37046 etc.). The only solution is to rebuild
the index and when we have customers with multiple
billion rows that's hugely inconvenient. This change
was a stupid error, we have sworn not to repeat it.

Why don't we introduce new collations such as
utf8_unicode_520_ci more quickly? Partly because
with InnoDB there was a limit on the number of
collations, but this has been resolved in MySQL
5.6. Partly because there is a policy that involves
a lengthy period between releases. For an explanation
of our development cycle, see
http://forge.mysql.com/wiki/Development_Cycle.
Delays can be frustrating and are valid grounds
for criticism, but our principal developer for
collation-related matters (Mr Alexander Barkov)
is doing an outstanding job addressing them all.
[22 Aug 2011 18:31] Marcin Wasilewski
Dear Peter,

Thank you for your patience and time to elaborate further on this matter. I understand now why changing existing collation rules introduces problems but it was not indicated before; I was also unaware of the existence of unicode_520_ci and I wish it was brought to our attention earlier. That would probably cut off unnecessary discussion.
[22 Aug 2011 18:48] Pierre Timmermans
If I want to stick to utf8_general_ci collation, then I will have to live with a work-around in all my queries to take this letter L into account. I understand that rebuilding an index can be a big task, but I still find it better than to go on forever with that annoyance in the utf8_general collation.

Of course I could change my collation to utf8_unicode_520_ci, I have to figure out what will be the impact.

But in any case, thanks for the clear explanations, now I understand your point.

Regards, Pierre