MySQL Bugs: #20813: CENTRAL EUROPEAN COLLATION MAP - cp1250_general_ci; cp1250_czech

Bug #20813	CENTRAL EUROPEAN COLLATION MAP - cp1250_general_ci; cp1250_czech_cs
Submitted:	2 Jul 2006 10:16	Modified:	5 Aug 2006 18:39
Reporter:	Peter Cicman	Email Updates:
Status:	No Feedback	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S2 (Serious)
Version:	4.0 + (older maybe too)	OS:	Any (ALL)
Assigned to:		CPU Architecture:	Any

Description:
I think theres one isue in collation map used for cp1250. It's about letters:

[Ž]  142  08/14  216  0x8e  CAPITAL LETTER Z WITH CARON
[ž]  158  09/14  236  0x9e  SMALL LETTER Z WITH CARON
[Č]  200  12/08  310  0xc8  CAPITAL LETTER C WITH CARON
[č]  232  14/08  350  0xe8  SMALL LETTER C WITH CARON
[Š]  138  08/10  212  0x8a  CAPITAL LETTER S WITH CARON
[š]  154  09/10  232  0x9a  SMALL LETTER S WITH CARON 

i think they are too simillar to some other letters in cp1250, and they may help most central european developpers.

THAT MEAN:
0x8e ~ 0x8f ~ 0x9e ~ 0x9f ~ 0xaf ~ 0xbf ~ 0x5a ~ 0x7a

0xc8 ~ 0xe8 ~ 0x43 ~ 0x63

0x8a ~ 0x9a ~ 0x53 ~ 0x73  

May i ask, is there a way to compile own GOOD CE collation? Thank you.

How to repeat:
FOR EXAMPLE:
mysql> SELECT "á" LIKE "%a%";
+----------------+
| "á" LIKE "%a%" |
+----------------+
|              1 |
+----------------+
1 row in set (0.00 sec)

(ITS OK)

mysql> SELECT "ž" LIKE "%ź%";
+----------------+
| "ž" LIKE "%ź%" |
+----------------+
|              1 |
+----------------+
1 row in set (0.00 sec)

(OK TOO)
----------------------------------------------------------------------------
AND THE PROBLEM LETTERS EXAMPLES:

mysql> SELECT "ž" LIKE "%z%";
+----------------+
| "ž" LIKE "%z%" |
+----------------+
|              0 |
+----------------+
1 row in set (0.00 sec)

mysql> SELECT "š" LIKE "%s%";
+----------------+
| "š" LIKE "%s%" |
+----------------+
|              0 |
+----------------+
1 row in set (0.00 sec)

mysql> SELECT "č" LIKE "%c%";
+----------------+
| "č" LIKE "%c%" |
+----------------+
|              0 |
+----------------+
1 row in set (0.00 sec)

it's the same for uppers (Š, Č, Ž) because there are no relations between these characters

Suggested fix:
I think some changes in collation tables, i found xml map in cp1250.xml

OLD MAP FOR cp1250_general_ci LOOKS LIKE:

  00  01  02  03  04  05  06  07  08  09  0A  0B  0C  0D  0E  0F
  10  11  12  13  14  15  16  17  18  19  1A  1B  1C  1D  1E  1F
  20  21  22  23  24  25  26  27  28  29  2A  2B  2C  2D  2E  2F
  30  31  32  33  34  35  36  37  38  39  3A  3B  3C  3D  3E  3F
  40  41  42  43  46  49  4A  4B  4C  4D  4E  4F  50  52  53  55
  56  57  58  59  5B  5C  5D  5E  5F  60  61  63  64  65  66  67
  68  41  42  43  46  49  4A  4B  4C  4D  4E  4F  50  52  53  55
  56  57  58  59  5B  5C  5D  5E  5F  60  61  7B  7C  7D  7E  7F
  80  81  82  83  84  85  86  87  88  89  5A  8B  5A  5B  62  62
  90  91  92  93  94  95  96  97  98  99  5A  9B  5A  5B  62  62
  20  A1  A2  50  A4  41  A6  59  A8  A9  59  AB  AC  AD  AE  62
  B0  B1  B2  50  B4  B5  B6  B7  B8  41  59  BB  50  BD  50  62
  58  41  41  41  41  50  45  43  44  49  49  49  49  4D  4D  46
  47  53  53  55  55  55  55  D7  58  5C  5C  5C  5C  60  5B  59
  58  41  41  41  41  50  45  43  44  49  49  49  49  4D  4D  46
  47  53  53  55  55  55  55  F7  58  5C  5C  5C  5C  60  5B  FF
----------------------------------------------------------------------------
AND I THINK THAT IT SHOULD LOOK LIKE THIS:

  00  01  02  03  04  05  06  07  08  09  0A  0B  0C  0D  0E  0F
  10  11  12  13  14  15  16  17  18  19  1A  1B  1C  1D  1E  1F
  20  21  22  23  24  25  26  27  28  29  2A  2B  2C  2D  2E  2F
  30  31  32  33  34  35  36  37  38  39  3A  3B  3C  3D  3E  3F
  40  41  42  43  46  49  4A  4B  4C  4D  4E  4F  50  52  53  55
  56  57  58  59  5B  5C  5D  5E  5F  60  61  63  64  65  66  67
  68  41  42  43  46  49  4A  4B  4C  4D  4E  4F  50  52  53  55
  56  57  58  59  5B  5C  5D  5E  5F  60  61  7B  7C  7D  7E  7F
  80  81  82  83  84  85  86  87  88  89  *59  8B  5A  5B  *61  *61
  90  91  92  93  94  95  96  97  98  99  *59  9B  5A  5B  *61  *61
  20  A1  A2  50  A4  41  A6  59  A8  A9  59  AB  AC  AD  AE  *61
  B0  B1  B2  50  B4  B5  B6  B7  B8  41  59  BB  50  BD  50  *61
  58  41  41  41  41  50  45  43  *43  49  49  49  49  4D  4D  46
  47  53  53  55  55  55  55  D7  58  5C  5C  5C  5C  60  5B  59
  58  41  41  41  41  50  45  43  *43  49  49  49  49  4D  4D  46
  47  53  53  55  55  55  55  F7  58  5C  5C  5C  5C  60  5B  FF

(changes are marked with * (asterix))

and the same problem is too in collation cp1250_czech_cs

Thank you for a problem report. Why do you think the result for SELECT "š" LIKE "%s%", for example, is wrong? Can you provide URLs/references to standards that proves that?

I have found this one, for example: http://www.ibphoenix.com/main.nfs?a=ibphoenix&page=ipb_win_cz

And, according to it, "s with caron" is different from "s"...

please wait, im working on it..
hope i'll reply soon

Please, check bug #3444 also. Isn't it similar?

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".