Bug #9406 Arabic letter (Feh, U+0641) gets corrupted when entered in a table
Submitted: 25 Mar 2005 21:06 Modified: 25 Apr 2005 18:07
Reporter: Martine Petrod Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1.7 & 4.1.10a OS:Windows (Windows XP)
Assigned to: CPU Architecture:Any

[25 Mar 2005 21:06] Martine Petrod
Description:
dummy-line-for-utf8-textfile-start-character.

-- This is a script documenting a strange problem I encountered with Arabic:
-- I am working on PC / Windows XP with MySQL 4.1.10a - just updated from 4.1.7:
-- ONE letter (Feh, Unicode 0641) seems to get corrupted when entered in a table, 
-- while all the other Arabic letters I used were OK.

use test;
tee outfile.txt;

drop table if exists ALetters;

create table ALetters (
  letter VARCHAR(2) CHARACTER SET UTF8, 
  name VARCHAR(5) CHARACTER SET UTF8);

-- entering 3 different letters for comparison:

INSERT INTO ALetters VALUES ('ب','ba'),('ف','fa'),('ك','kaf');
SELECT * FROM ALetters;

-- seems OK so far, BUT the letter 'fa' is incorrectly rendered in the table (column letter),
-- It comes out as a ? in the outfile.txt (if the textfile is encoded as utf8), and
-- on my html page using perl/CGI/DBI it is displayed as another letter (Tehe: ٿ, Unicode 067F)? 
-- But the letter comes out correctly when using the following statements:

select hex('ف');
select unhex('d981');
select unhex(hex('ف'));

-- select unhex(hex('ب'));
-- select unhex(hex('ف'));
-- select unhex(hex('ك'));

-- BUT when using the same statement with values from the table,
-- the letter 'fa' is of course corrupted as before:

INSERT INTO ALetters select unhex(hex(letter)), name from Aletters;
SELECT * FROM ALetters;

-- WHY is the letter 'fa' corrupted? How do I put it right?

How to repeat:
-- use following source script from utf8 encoded text file:

use test;
tee outfile.txt;

drop table if exists ALetters;

create table ALetters (
  letter VARCHAR(2) CHARACTER SET UTF8, 
  name VARCHAR(5) CHARACTER SET UTF8);

INSERT INTO ALetters VALUES ('ب','ba'),('ف','fa'),('ك','kaf');
SELECT * FROM ALetters;

select hex('ف');
select unhex('d981');
select unhex(hex('ف'));

INSERT INTO ALetters select unhex(hex(letter)), name from Aletters;
SELECT * FROM ALetters;

Suggested fix:
?
[25 Mar 2005 21:13] Martine Petrod
Script

Attachment: TestA.sql (text/plain), 1.44 KiB.

[25 Mar 2005 21:13] Martine Petrod
Output file

Attachment: outfile.txt (application/octet-stream, text), 1.22 KiB.

[25 Mar 2005 21:16] Martine Petrod
Original source script and output files are attached as utf8 encoded text files in order to preserve the Arabic letters.
[25 Mar 2005 21:18] Martine Petrod
Category: also MyISAM Table Handler
[26 Mar 2005 2:27] Heikki Tuuri
Hi!

My guess is that you have set the client character set or the mysql client character set wrong in your my.cnf. By default, the mysql interactive SQL client uses the MS-DOS character set.

Please test with different client charsets in my.cnf.

Regards,

Heikki
[26 Mar 2005 10:56] Martine Petrod
Hi!
Thank you very much - 
You were right and I just had to force utf8 as client character set in my.ini as described at the end of 5.8.1. to make it work.
By the way I'm glad to see that bug #7571 (The output of the STATUS (\s) command in mysql had the values for the server and client character sets reversed) is being fixed: that had me confused for a while...
[18 Apr 2005 20:29] Allen Jensen
Sounds like bug ID 8730.  Have seen a couple of these in the database.

Still seeing problem in 4.1.10

Unicode character U+30CD converted to UTF8 is (hex) E3 83 8D
Store this in a utf8 column (load table infile).
Select or any other means of retrevial shows corrupted data.

%_set_% all defaulted - show as latin1 except system which
shows as utf8.  Did a SET NAMES 'utf8'; just before the load.