Bug #15847 Wrong result when two "tone characters" are in use in tis-620 charset
Submitted: 19 Dec 2005 2:31 Modified: 7 Jun 2006 7:09
Reporter: Supiti Buranawatanachoke Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0.16 OS:Linux (Linux)
Assigned to: Alexander Barkov CPU Architecture:Any

[19 Dec 2005 2:31] Supiti Buranawatanachoke
Description:
Thai language has "tone character." There seems to be some problems in sorting/comparing two different texts which contains at least two tone characters. Any text with only one "tone character" doesn't seem to have any problem. 

Please see how to repeat for more info. 

How to repeat:
I create a table with CHARSET=tis620 and insert two rows "
[19 Dec 2005 2:37] Supiti Buranawatanachoke
it seems like we can't post tis-620 characters here because my comment in the "how to repeat" section got cut off. 

I will put it in the txt file and post it in Files section.
[19 Dec 2005 2:40] Supiti Buranawatanachoke
How to repeat this bug, a text file with tis-620 encoding

Attachment: how_to_repeat (application/octet-stream, text), 934 bytes.

[19 Dec 2005 19:22] Aleksey Kishkin
mysql> select * from test where word='เด็กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
| เด้กน้อย |
+--------------------------+
2 rows in set (0.00 sec)

mysql> select * from test where word='เด้กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
| เด้กน้อย |
+--------------------------+
2 rows in set (0.00 sec)
[19 Dec 2005 19:29] Aleksey Kishkin
Supiti, if I set proper collation for table, retrieving works properly. I am setting 'Not a bug' status, if you disagree, please dont hesitate to reopen it.

mysql> CREATE TABLE `test` ( `word` varchar(100) default NULL ) ENGINE=MyISAM DEFAULT CHARSET=tis620 collate=tis620_thai_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO test (word) VALUES ('เด้กน้อย');
Query OK, 1 row affected (0.00 sec)

mysql> select * from test where word='เด็กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
+--------------------------+
1 row in set (0.00 sec)

mysql> select * from test where word='เด็กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
+--------------------------+
1 row in set (0.00 sec)

mysql> select * from test where word='เด้กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
+--------------------------+
1 row in set (0.00 sec)
[19 Dec 2005 22:03] Supiti Buranawatanachoke
Hi, thanks for a really quick reply.

I looked at the result from last select you posted and it is actually incorrect as well.

mysql> select * from test where word='เด้กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
+--------------------------+

The word in the query is 'เด้กน้อย' but the word in the result is 'เด็กน้อย'. They are actually different. Please notice the third character. It could be hard to see because they are very similar, if you want, I can take a screenshot and circle the difference for you.

Thanks again :)
[19 Dec 2005 22:14] Supiti Buranawatanachoke
Picture highlight the difference between the query and the result

Attachment: screenshot.gif (image/gif, text), 3.02 KiB.

[20 Dec 2005 15:11] Aleksey Kishkin
I see. I thought it's a second char. 
Verified.
[5 Jun 2006 12:42] Alexander Barkov
Dear Supiti,

According to the sources, these characters look like they
were intentionally made equal. Unfortunately, the original
authors, who mentioned in ctype-tis620.c, do not respond :(

Also, I've checked the Unicode default collation:
ftp://ftp.unicode.org/Public/UCA/latest/allkeys.txt
These two characters are ignorable on the primary level,
which makes them equal for comparison:

0E47  ; [.0000.011A.0002.0E47] # THAI CHARACTER MAITAIKHU
0E49  ; [.0000.011C.0002.0E49] # THAI CHARACTER MAI THO

notice, the first weight is 0000 for both characters,
which makes me more confident that the original
authors made these rules intentionally.

Probably, Thai has different rules:
- sometimes these characters can be considered as equal,
- and sometimes - as different?

Do you know any links to national standards of Thailand about
string comparison/sorting? 

Thanks!
[5 Jun 2006 20:10] Supiti Buranawatanachoke
็Hi,

Your speculation is correct. 

I'm not a linguistic but let me explain briefly about a word in Thai. A simple Thai word consists of an alphabet, a vowel, another alphabet (optional) and a tone indication character (optional). 

Comparing to an English word "can", "c" is the first alphabet, "a" is a vowel and "n" is another alphabet.

But the problem is that a vowel is not necessary one character. One vowel can use 1-4 characters and the place of each character is different. It can be on the leftmost side of a word, above the alphabet, at the back of the alphabet.

The tone indication is always used above the alphabet or above the vowel character (if it's place is above the alphabet). So it can be placed at the same place where the vowel character can go.

I found a website with a good illustration of the rule. It is not complete but it should be useful for you to see the picture of multiple-character vowel.
http://www.omniglot.com/writing/thai.htm

When you type, you go from the leftmost character and if there is something above it, type it before moving to the next character.

In our case of MAITAIKHU, it always has to be used with "SARA AE". For example, a word that we discussed before "เด็ก" which consists of four characters: "SARA AE", "DO DEK", "MAITAIKHU", "KO KAI".

But "SARA AE" combined with "MAITAIKHU" is only one vowel. And to sort them alphabetically like in Thai standard dictionary, you have to consider them as one vowel.

So what Thai standard dictionary does is to re-arrange them to "DO DEK", ["SARA AE", "MAITAIKHU"], "KO KAI" in order to sort.

But another word "เด้ก" which has the tone indicator "MAI THO" has a vowel that is only one character "SARA AE". So to sort, the rearrangement should be "DO DEK", "SARA AE" "MAI THO", "KOKAI"

I cannot find any website, that is written in English nor any algorithm that is usually used to sort Thai words. There should be one, I will ask around.

Sorry this is long. I hope you understand a bit more about Thai.

Thanks,
Supiti
[5 Jun 2006 20:13] Supiti Buranawatanachoke
Oops, sorry the rearrangement of "เด้ก" should be "DO DEK", "SARA AE", "KO KAI" , "MAITHO"

The tone indicator should be compared last.
[7 Jun 2006 7:07] Alexander Barkov
Supiti, thanks for your description and the links.
Unfortunately, I have to close your report as
not a bug. There is another link saying that ignoring
tonal marks on primary level is ok:
http://www-306.ibm.com/software/globalization/topics/thai/collation.jsp
So, tis620_thai_ci is ok, it's not buggy.

You just need a new collation, which will 
use both primary and secondary levels
for comparison, not just primary level
(like tis620_thai_ci does).
[7 Jun 2006 7:08] Alexander Barkov
As a temporary workaround, you can use:

select * from test where word=BINARY 'เด้กน้อย';

or use tis620_bin in CREATE TABLE.