Bug #5605 UTF-8 encoding of Russian, Vietnames incomplete
Submitted: 16 Sep 2004 9:52 Modified: 16 Oct 2004 20:05
Reporter: Jan Uetz Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server Severity:S1 (Critical)
Version:4.1.4-gamma-nt OS:Linux (Linux, Windows XP)
Assigned to: CPU Architecture:Any

[16 Sep 2004 9:52] Jan Uetz
Description:
When entering Cyrillic (Russian) of Vietnamese text through a UTF-8 encoded form, a few characters in both of these languages (these are examples) are not correctly interpreted by MySQL, even though the database is encoded in UTF-8, as well as the field the data is stored into. If the field is in Latin1, the code will be displayed properly on the webpage (also in UTF-8), however the fulltext search in MySQL wont find anything. If the field is encoded in UTF-8 (Collation UTF8_general_ci) the fulltext search will work beautifully, however as mentioned several characters will be displayed as a square.
We have multi-site, multi-language, multi-character-set Content Management Systems, which need to support different character-sets in the same field, and most of all in full-text search!

How to repeat:
Enter Russian or Vietnamese text using a UTF-8 encoded form into a MySQL Table encoded in UTF-8. Then read that record through a Webpage also encoded in UTF-8. Certain characters will be displayed as a square.

Suggested fix:
Possibly additional collations need to be made for utf8-russian and other languages - or the implementation of utf8-general needs to be completed!
[16 Sep 2004 9:53] Jan Uetz
Example of erreneous character set display in browser (and obviously in the table)

Attachment: characterset_errors.gif (image/gif, text), 34.61 KiB.

[16 Sep 2004 9:59] Jan Uetz
Here you see the code of the same text. Highlighted are some of the characters that are being displayed as squares (Ñ?).

Attachment: examples_in_table.gif (image/gif, text), 22.34 KiB.

[16 Sep 2004 10:02] Jan Uetz
This is what the same text should look like! This way you can compare the faulty characters

Attachment: Correct_Text_Example.gif (image/gif, text), 10.99 KiB.

[16 Sep 2004 10:03] Jan Uetz
Here's the same text in a word .DOC for testing purposes

Attachment: testdoc.doc (application/octet-stream, text), 21.00 KiB.

[16 Sep 2004 20:05] MySQL Verification Team
Hi Jan,

Thank you for the report, but you didn't mention what client character set and connection character set you are using. Could you provide me this info?
[14 Feb 2005 22:54] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".