Bug #21347 hebrew utf8 ORDER BY produces bad results in OpenBSD MySQL 5.0
Submitted: 30 Jul 2006 9:36 Modified: 29 Jun 2007 11:25
Reporter: Jay Gay Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S1 (Critical)
Version:5.0 OS:Any (Linux, OpenBSD)
Assigned to: Georgi Kodinov CPU Architecture:Any
Tags: hebrew, order by, utf8

[30 Jul 2006 9:36] Jay Gay
Description:
Ordering of utf8 encoded hebrew words produces results that are not alphabetically ordered in OpenBSD.

Upon investigation using the hex() function it seems that the hebrew words are not being stored using utf8 encoding but some strange encoding that I have never seen before prefixed by hex values C397. The ordering is following a hexadecimal logic but not one that is compliant with utf8 values which should be of the form 0xD7??

How to repeat:
mysql> select *,hex(convert (phrase using utf8)),hex(convert(phrase using hebrew)) from alphasorted1aleppo where count > 1000 order by hex(phrase);
+------------+-------+-------+----------------------------------------------+-----------------------------------+
| phrase     | place | count | hex(convert (phrase using utf8))             | hex(convert(phrase using hebrew)) |
+------------+-------+-------+----------------------------------------------+-----------------------------------+
| {ס}       |   423 |  2324 | 7BC397C2A17D                                 | 7BAA3F7D                          |
| {פ}       |    40 |  1273 | 7BC397C2A47D                                 | 7BAAA47D                          |
| אשר     |    48 |  4786 | C397C290C397C2A9C397C2A8                     | AA90AAA9AAA8                      |
| את       |     4 |  7188 | C397C290C397C2AA                             | AA90AA3F                          |
| אל       |    56 |  4051 | C397C290C397C593                             | AA90AA3F                          |
| איש     |   316 |  1106 | C397C290C397E284A2C397C2A9                   | AA90AA3FAAA9                      |
| על       |    13 |  3260 | C397C2A2C397C593                             | AAA2AA3F                          |
| עד       |   451 |  1003 | C397C2A2C397E2809C                           | AAA2AA3F                          |
| לא       |   193 |  3048 | C397C593C397C290                             | AA3FAA90                          |
| לו       |   278 |  1085 | C397C593C397E280A2                           | AA3FAA3F                          |
| מלך     |  1472 |  1023 | C397C5BEC397C593C397C5A1                     | AA3FAA3FAA3F                      |
| בני     |   704 |  1675 | C397E28098C397C2A0C397E284A2                 | AA3FAAA0AA3F                      |
| בן       |   632 |  1623 | C397E28098C397C5B8                           | AA3FAA3F                          |
| בית     |  1337 |  1022 | C397E28098C397E284A2C397C2AA                 | AA3FAA3FAA3F                      |
| המלך   |  1570 |  1014 | C397E2809DC397C5BEC397C593C397C5A1           | AA3FAA3FAA3FAA3F                  |
| ואת     |     6 |  2220 | C397E280A2C397C290C397C2AA                   | AA3FAA90AA3F                      |
| ולא     |   326 |  1494 | C397E280A2C397C593C397C290                   | AA3FAA3FAA90                      |
| ויאמר |    19 |  2061 | C397E280A2C397E284A2C397C290C397C5BEC397C2A8 | AA3FAA3FAA90AA3FAAA8              |
| כל       |   114 |  2854 | C397E280BAC397C593                           | AA3FAA3F                          |
| כי       |    25 |  3968 | C397E280BAC397E284A2                         | AA3FAA3F                          |
| ישראל |  3616 |  2095 | C397E284A2C397C2A9C397C2A8C397C290C397C593   | AA3FAAA9AAA8AA90AA3F              |
| יהוה   |   187 |  5811 | C397E284A2C397E2809DC397E280A2C397E2809D     | AA3FAA3FAA3FAA3F                  |
+------------+-------+-------+----------------------------------------------+-----------------------------------+
22 rows in set (0.13 sec)

Suggested fix:
Use the correct encoding.
[30 Jul 2006 9:56] Jay Gay
Here is the table creation:

 CREATE TABLE `aleppo1phrase` (
  `phrase` varchar(255) collate utf8_unicode_ci NOT NULL,
  `count` int(11) NOT NULL,
  PRIMARY KEY  (`phrase`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8 COLLATE=utf8_unicode_ci

As you can see the charset is utf8 and the collate is utf8_unicode_ci.
[2 Aug 2006 19:52] Sveta Smirnova
Thank you for the report.

Could you please provide dump of about 10 rows which allows me to repeat the problem?
[3 Aug 2006 6:33] Sveta Smirnova
Please, also provide output of SHOW VARIABLES LIKE '%char%'; statement and statement you used to insert values in the table.
[2 Sep 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[8 May 2007 15:10] liron wachs
hello,
I was searching for a solution for my bug through google and found this bug that was reported more than half a year ago and represents my bug exactly - the ORDER BY is not ordering hebrew words that are encoded in utf8 in the correct order.I couldn't find here any solutionn for it.
Do you have a solution for this bug?

thanks.
[8 May 2007 18:04] Omer Barnir
The above list in the correct Hebrew order (added a column referancing the original one)

Attachment: correct order.pdf (application/pdf, text), 15.87 KiB.

[8 May 2007 18:13] Sveta Smirnova
Thank you for the report.

Verified as described.
[8 May 2007 18:13] Sveta Smirnova
test case

Attachment: bug21347.test (application/octet-stream, text), 503 bytes.

[29 Jun 2007 10:37] Sveta Smirnova
Test with corrected connection collation

Attachment: bug21347.test (application/octet-stream, text), 546 bytes.

[29 Jun 2007 11:25] Georgi Kodinov
Thank you for taking the time to write to us, but this is not a bug. Please double-check the documentation available at http://dev.mysql.com/doc/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

There seem to be two problems with this report :
1. utf-8 literals are passed to the server while the character set used by the server to parse sql statements (character_set_client) is not utf-8 (but the default latin1). This is checked by :
 SHOW VARIABLES LIKE 'character_set_client'
This returns latin1 (by default).
 This causes the server to interpret the utf-8 as latin1 and encode it again as utf-8 : hence the strange codes that are seen through hex().

2. "ORDER BY HEX(string)" is not the same as "ORDER BY string" : that's why even character_set_client (or SET NAMES utf8) is set to utf-8 the original hebrew character sequences appear unsorted.
[29 Jun 2007 11:26] Georgi Kodinov
a script that produces correct results

Attachment: bug.test (, text), 735 bytes.