Bug #7976 UTF-8 collation is not correct for some macedonian cyrillic letters
Submitted: 18 Jan 2005 0:22 Modified: 18 Oct 2006 17:35
Reporter: damjan Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1.* OS:Linux (Linux)
Assigned to: Alexander Barkov CPU Architecture:Any

[18 Jan 2005 0:22] damjan
Description:
Some cyrillic letters that are only used as part of the macedonian alphabet, are not sorted corectly in a select ... order by ... statement, when they are written in the UTF-8 encoding
. 
What MySQL returns is this:
Ѕ ѕ Ј ј Љ љ Њ њ Џ џ А а Б б В в Г Ѓ г ѓ Д д Е е Ж ж З з И и К Ќ к ќ Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш
While the correct collation should return this:
А а Б б В в Г г Д д Ѓ ѓ Е е Ж ж З з Ѕ ѕ И и Ј ј К к Л л Љ љ М м Н н Њ њ О о П п Р р С с Т т Ќ ќ У у Ф ф Х х Ц ц Ч ч Џ џ Ш ш

I hope you can see the cyrillic characters, or suggest me a better way to represent the characters to you.

The problem is in the 5 first letters that MySQL returned, and also the letters Ѓ and Ќ should be placed after д and т, respectively.

How to repeat:
I've made a small python script that prints out the MySQL collation and the proper one this is the output:
character_set_client : utf8
character_set_connection : utf8
character_set_database : latin1
character_set_results : utf8
character_set_server : latin1
character_set_system : utf8
character_sets_dir : /usr/share/mysql/charsets/
collation_connection : utf8_general_ci
collation_database : latin1_swedish_ci
collation_server : latin1_swedish_ci
Ѕ ѕ Ј ј Љ љ Њ њ Џ џ А а Б б В в Г г Ѓ ѓ Д д Е е Ж ж З з И и К к Ќ ќ Л л М м Н н О о П п Р р С с Т т У у Ф ф Х х Ц ц Ч ч Ш ш
А а Б б В в Г г Д д Ѓ ѓ Е е Ж ж З з Ѕ ѕ И и Ј ј К к Л л Љ љ М м Н н Њ њ О о П п Р р С с Т т Ќ ќ У у Ф ф Х х Ц ц Ч ч Џ џ Ш ш

It doesn't make difference if I ALTER the database test to UTF-8 collation. This is the script:
#! /usr/bin/python2.3
# -*- coding: utf-8 -*-

DB='test'
USER='root'
PASS=''

MACEDONIAN_LETTERS = unicode('АаБбВвГгДдЃѓЕеЖжЗзЅѕИиЈјКкЛлЉљМмНнЊњОоПпРрСсТтЌќУуФфХхЦцЧчЏџШш', 'utf-8')
MACEDONIAN_LETTERS = list(MACEDONIAN_LETTERS) # Acctualy I need a list

# Set the locale, important for collation
import locale
locale.setlocale(locale.LC_ALL, 'mk_MK.UTF-8')

import MySQLdb

def showvariables(db):
    c = db.cursor()
    c.execute('show variables')
    for var in iter(c.fetchone, None):
        if var[0].startswith('character') or var[0].startswith('collation'):
            print var[0], ':', var[1]

db = MySQLdb.connect(user=USER, passwd=PASS, db=DB)
c = db.cursor()
c.execute('set character set utf8')
c.execute('set names utf8')
showvariables(db)

try:
    c.execute(
        '''CREATE TABLE `alphabet` (
          `letter` longtext NOT NULL
          ) ENGINE=MyISAM
          DEFAULT CHARACTER SET utf8
    ''')
except:
    # Probably already exists (although other errors are possible)
    c.execute('truncate `alphabet`')

for letter in MACEDONIAN_LETTERS:
  c.execute('''insert into alphabet set letter = %s''', letter.encode('utf-8'))

c.execute('select letter from alphabet order by letter asc')
for result in iter(c.fetchone, None):
  print result[0],
print

for letter in MACEDONIAN_LETTERS:
  print letter.encode('utf-8'),
print
[18 Jan 2005 0:24] damjan
python script

Attachment: mk-coll-test.py (application/octet-stream, text), 1.39 KiB.

[18 Jan 2005 0:27] damjan
Since the bug reporting form mangled what I wrote, I've attached the python script and its output into this bug report. 

Also I've forgotten to say that this script uses MySQLdb-1.1.8 for python, compiled with the installed MySQL-4.1.7 libraries.
[18 Jan 2005 0:28] damjan
Output of the script

Attachment: output.txt (text/plain), 714 bytes.

[20 Jan 2005 11:25] Aleksey Kishkin
Hi! unfortunatelly web-based bug system converts all non-latin characters to numbers, and that is why it's hard to undrerstand the issue. Could you please put your message into text file (for example text file in utf8) and attach it to this issue?
[20 Jan 2005 11:34] Aleksey Kishkin
sorry already see it in 'output'
[20 Jan 2005 11:47] Aleksey Kishkin
tested on slackware 10 , mysqldb (mysql-python 1.0.1), mysqld 4.1.9
[21 Jan 2005 10:55] Alexander Barkov
Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

Hello Damjan,

This is not a bug. You should use collation utf8_unicode_ci
to get Macedonian letter sorted correctly.

Please use this create creary:

CREATE TABLE alphabet
(
 letter longtext NOT NULL
) DEFAULT CHARACTER SET utf8 COLLATE utf8_unicode_ci;

Then insert the letters and try sorting again.
It will return the resuls in correct order.
[21 Jan 2005 12:34] damjan
Thank you, I didn't know that I have to use utf8_unicode_ci.
I confirm that it works even with mysq-4.1.7
[18 Oct 2006 16:38] [ name withheld ]
I've encountered this problem too, but my point is different here. I'm just not ok with recalling to these leters as macedonian and metioning the term 'macedonian alphabet'. Macedonia is a country situated next to Bulgaria which is where I'm from. Macedonia also used to be bulgarian territory some decades ago. Bulgaria was established in 681 AD and has its own alphabet - bulgarian which is reffered as macedonian here. You can investigate further. So just to let you guys know. Also I'll be very thankful if you change the topic.
[18 Oct 2006 17:35] damjan
[ name withheld ] get a life
[17 Nov 2009 22:06] Igor Janevski
OMG , really get a life. What you said is not applicable here also because the bulgarian language does not have letters like J,Lj,Kj etc etc etc (cannot write them in Cyrillic here). On behalf of the people who are looking for technical help and answers, please don't pollute the this site....