MySQL Bugs: #3444: Case sensitivity in czech comparisons

Bug #3444	Case sensitivity in czech comparisons
Submitted:	12 Apr 2004 4:13	Modified:	18 Jan 2018 13:01
Reporter:	Tomas Tikovsky	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S4 (Feature request)
Version:	4.1.1	OS:	Windows (winxp)
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
Hi,

im quite confused how to deal with this issue. Maybe its my fault but i dont understand it.

Manual says that czech is always case sensitive but mr. Golubov said it should work in this version of mysql. I wanted to use fulltext indeces but this makes them useless. If searching for &#269;esky i want also results for &#268;esky. And thats not working. Its same in "where like" clauses. If there is any way how to help mysql understand CASE INSENSITIVE comparisons of accented characters i would be very pleased if u could advice me. If its not mysql fault please accept my appologies, i just didnt got it from manual. Thanx for any help.
Regards
Tomas Tikovsky

How to repeat:
Default server setup:
mysql> show variables like "c%";
+--------------------------+--------------------------+
| Variable_name            | Value                    |
+--------------------------+--------------------------+
| character_set_server     | latin1                   |
| character_set_system     | utf8                     |
| character_set_database   | latin1                   |
| character_set_client     | latin1                   |
| character_set_connection | latin1                   |
| character-sets-dir       | D:\mysql\share\charsets/ |
| character_set_results    | latin1                   |
| collation_connection     | latin1_swedish_ci        |
| collation_database       | latin1_swedish_ci        |
| collation_server         | latin1_swedish_ci        |
| concurrent_insert        | ON                       |
| connect_timeout          | 5                        |
+--------------------------+--------------------------+

mysql> select "s"="S";
+---------+
| "s"="S" |
+---------+
|       1 |
+---------+
1 row in set (0.00 sec)

Thats expected behaviour, but i need this to happen also using accented characters in czech language. As in theese š=Š. That was small and big "s" letter with inverted circumflex (wedge).
So i've setup server as following. Im on windows with cp1250 charset so i used this. 

mysql> show variables like "c%";
+--------------------------+--------------------------+
| Variable_name            | Value                    |
+--------------------------+--------------------------+
| character_set_server     | cp1250                   |
| character_set_system     | utf8                     |
| character_set_database   | cp1250                   |
| character_set_client     | cp1250                   |
| character_set_connection | cp1250                   |
| character-sets-dir       | D:\mysql\share\charsets/ |
| character_set_results    | cp1250                   |
| collation_connection     | cp1250_czech_ci          |
| collation_database       | cp1250_czech_ci          |
| collation_server         | cp1250_czech_ci          |
| concurrent_insert        | ON                       |
| connect_timeout          | 5                        |
+--------------------------+--------------------------+

--------------------------------------------------------------------------

mysql> select "š"="Š";
+---------+
| "š"="Š" |
+---------+
|       0 |
+---------+
1 row in set (0.00 sec)

mysql> select "s"="S";
+---------+
| "s"="S" |
+---------+
|       0 |
+---------+
1 row in set (0.00 sec)

This could indicate that comparison is case sensitive.

Trying to change category to server. I mistyped it.

ok, you are right. confirmed.
According to the comments in the ctype-win1250ch.c (you can find the file with
"grep cp1250_czech_ci") the comparison is indeed case-sensitive - so it's how the original contributor implemented it.

The very least we have to do is to rename the collation to cp1250_czech_cs ("cs" means for "case sensitive"). It's obviously will not help you, though :)
Another solution would be of course, to make ctype-win1250ch.c do case-insensitive comparison.
Whether we can do it (and when we can do it), you'll hear from Alexander Barkov - who is the developer behind our character set code. He will also reply to your email to internals@.

Also, you may try a workaround - use latin2_czech_ci as database/server charset and cp1258 as client charset only. Then all data will be stored/compared in latin2_czech_ci - that should work case-insentive, and will be converted to cp1258 before sending to the client.

Thanks for reply, but i think that latin2_czech_ci charset is unfortunately case-sensitive as well.

show variables like "c%"
+--------------------------+--------------------------+
| Variable_name            | Value                    |
+--------------------------+--------------------------+
| character_set_server     | latin2                   |
| character_set_system     | utf8                     |
| character_set_database   | latin2                   |
| character_set_client     | latin2                   |
| character_set_connection | latin2                   |
| character-sets-dir       | C:\Mysql\share\charsets/ |
| character_set_results    | latin2                   |
| collation_connection     | latin2_czech_ci          |
| collation_database       | latin2_czech_ci          |
| collation_server         | latin2_czech_ci          |
| concurrent_insert        | ON                       |
+--------------------------+--------------------------+

mysql> select "s"="S";
+---------+
| "s"="S" |
+---------+
|       0 |
+---------+
1 row in set (0.00 sec)

Well, as i know czech must be case sensitive when sorting results. But this behaviour in fulltext index is a bit unpleasant. Im free for any help u would need, 'couse it would help a lot of people that despairs of it. Thanks in advance.

Regards
Tomas Tikovsky

We should definitely add case and accent insensitive counterparts
to both latin2 and cp1250 Czech collations. I will learn if
it is possible to reuse the case sensitive code asap.

See also worklog item WL#1875

This has been idle for two years. Needs revisiting. :)

[30 Nov 2017 23:53] Xing Z Zhang 

This has been fixed in 4.1.3 by adding new collations: utf8_czech_ci, ucs2_czech_ci, utf16_czech_ci and utf32_czech_ci.