MySQL Bugs: #15847: Wrong result when two "tone characters" are in use in tis-620 charset

Bug #15847	Wrong result when two "tone characters" are in use in tis-620 charset
Submitted:	19 Dec 2005 2:31	Modified:	7 Jun 2006 7:09
Reporter:	Supiti Buranawatanachoke	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S2 (Serious)
Version:	5.0.16	OS:	Linux (Linux)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
Thai language has "tone character." There seems to be some problems in sorting/comparing two different texts which contains at least two tone characters. Any text with only one "tone character" doesn't seem to have any problem. 

Please see how to repeat for more info. 

How to repeat:
I create a table with CHARSET=tis620 and insert two rows "

it seems like we can't post tis-620 characters here because my comment in the "how to repeat" section got cut off. 

I will put it in the txt file and post it in Files section.

How to repeat this bug, a text file with tis-620 encoding

Attachment: how_to_repeat (application/octet-stream, text), 934 bytes.

mysql> select * from test where word='เด็กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
| เด้กน้อย |
+--------------------------+
2 rows in set (0.00 sec)

mysql> select * from test where word='เด้กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
| เด้กน้อย |
+--------------------------+
2 rows in set (0.00 sec)

Supiti, if I set proper collation for table, retrieving works properly. I am setting 'Not a bug' status, if you disagree, please dont hesitate to reopen it.

mysql> CREATE TABLE `test` ( `word` varchar(100) default NULL ) ENGINE=MyISAM DEFAULT CHARSET=tis620 collate=tis620_thai_ci;
Query OK, 0 rows affected (0.01 sec)

mysql> INSERT INTO test (word) VALUES ('เด้กน้อย');
Query OK, 1 row affected (0.00 sec)

mysql> select * from test where word='เด็กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
+--------------------------+
1 row in set (0.00 sec)

mysql> select * from test where word='เด็กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
+--------------------------+
1 row in set (0.00 sec)

mysql> select * from test where word='เด้กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
+--------------------------+
1 row in set (0.00 sec)

Hi, thanks for a really quick reply.

I looked at the result from last select you posted and it is actually incorrect as well.

mysql> select * from test where word='เด้กน้อย';
+--------------------------+
| word                     |
+--------------------------+
| เด็กน้อย |
+--------------------------+

The word in the query is 'เด้กน้อย' but the word in the result is 'เด็กน้อย'. They are actually different. Please notice the third character. It could be hard to see because they are very similar, if you want, I can take a screenshot and circle the difference for you.

Thanks again :)

Picture highlight the difference between the query and the result

Attachment: screenshot.gif (image/gif, text), 3.02 KiB.

I see. I thought it's a second char. 
Verified.

Dear Supiti,

According to the sources, these characters look like they
were intentionally made equal. Unfortunately, the original
authors, who mentioned in ctype-tis620.c, do not respond :(

Also, I've checked the Unicode default collation:
ftp://ftp.unicode.org/Public/UCA/latest/allkeys.txt
These two characters are ignorable on the primary level,
which makes them equal for comparison:

0E47  ; [.0000.011A.0002.0E47] # THAI CHARACTER MAITAIKHU
0E49  ; [.0000.011C.0002.0E49] # THAI CHARACTER MAI THO

notice, the first weight is 0000 for both characters,
which makes me more confident that the original
authors made these rules intentionally.

Probably, Thai has different rules:
- sometimes these characters can be considered as equal,
- and sometimes - as different?

Do you know any links to national standards of Thailand about
string comparison/sorting? 

Thanks!

็Hi,

Your speculation is correct.

I'm not a linguistic but let me explain briefly about a word in Thai. A simple Thai word consists of an alphabet, a vowel, another alphabet (optional) and a tone indication character (optional).

Comparing to an English word "can", "c" is the first alphabet, "a" is a vowel and "n" is another alphabet.

But the problem is that a vowel is not necessary one character. One vowel can use 1-4 characters and the place of each character is different. It can be on the leftmost side of a word, above the alphabet, at the back of the alphabet.

The tone indication is always used above the alphabet or above the vowel character (if it's place is above the alphabet). So it can be placed at the same place where the vowel character can go.

I found a website with a good illustration of the rule. It is not complete but it should be useful for you to see the picture of multiple-character vowel.
http://www.omniglot.com/writing/thai.htm

When you type, you go from the leftmost character and if there is something above it, type it before moving to the next character.

In our case of MAITAIKHU, it always has to be used with "SARA AE". For example, a word that we discussed before "เด็ก" which consists of four characters: "SARA AE", "DO DEK", "MAITAIKHU", "KO KAI".

But "SARA AE" combined with "MAITAIKHU" is only one vowel. And to sort them alphabetically like in Thai standard dictionary, you have to consider them as one vowel.

So what Thai standard dictionary does is to re-arrange them to "DO DEK", ["SARA AE", "MAITAIKHU"], "KO KAI" in order to sort.

But another word "เด้ก" which has the tone indicator "MAI THO" has a vowel that is only one character "SARA AE". So to sort, the rearrangement should be "DO DEK", "SARA AE" "MAI THO", "KOKAI"

I cannot find any website, that is written in English nor any algorithm that is usually used to sort Thai words. There should be one, I will ask around.

Sorry this is long. I hope you understand a bit more about Thai.

Thanks,
Supiti

Oops, sorry the rearrangement of "เด้ก" should be "DO DEK", "SARA AE", "KO KAI" , "MAITHO"

The tone indicator should be compared last.

Supiti, thanks for your description and the links.
Unfortunately, I have to close your report as
not a bug. There is another link saying that ignoring
tonal marks on primary level is ok:
http://www-306.ibm.com/software/globalization/topics/thai/collation.jsp
So, tis620_thai_ci is ok, it's not buggy.

You just need a new collation, which will 
use both primary and secondary levels
for comparison, not just primary level
(like tis620_thai_ci does).

As a temporary workaround, you can use:

select * from test where word=BINARY 'เด้กน้อย';

or use tis620_bin in CREATE TABLE.