MySQL Bugs: #28617: Gibberish when reading utf8 TEXT column through ADO

Bug #28617	Gibberish when reading utf8 TEXT column through ADO
Submitted:	23 May 2007 12:05	Modified:	10 Jan 2008 8:59
Reporter:	daniel gman	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Connector / ODBC	Severity:	S3 (Non-critical)
Version:	5.1, 3.51	OS:	Windows (xp)
Assigned to:	Jess Balint	CPU Architecture:	Any
Tags:	utf8 text

Description:
I’m using MySQL V5.0 and ODBC V5.00.11, I connect to the DB from ASP (ADO).
When I set a Column as VARCHAR and store Hebrew characters all work well and I can read them properly, but if the column set to TEXT then the Hebrew characters are store properly in the DB but when I try to read them back I get gibberish. When I use the GUI tool I can see that in the DB all the characters are stored correctly.

How to repeat:
Create the next table and read the data from text1,text2.
-- Server version	5.0.24a-community-nt
/*!40101 SET @OLD_CHARACTER_SET_CLIENT=@@CHARACTER_SET_CLIENT */;
/*!40101 SET @OLD_CHARACTER_SET_RESULTS=@@CHARACTER_SET_RESULTS */;
/*!40101 SET @OLD_COLLATION_CONNECTION=@@COLLATION_CONNECTION */;
/*!40101 SET NAMES utf8 */;
/*!40014 SET @OLD_UNIQUE_CHECKS=@@UNIQUE_CHECKS, UNIQUE_CHECKS=0 */;
/*!40014 SET @OLD_FOREIGN_KEY_CHECKS=@@FOREIGN_KEY_CHECKS, FOREIGN_KEY_CHECKS=0 */;
/*!40101 SET @OLD_SQL_MODE=@@SQL_MODE, SQL_MODE='NO_AUTO_VALUE_ON_ZERO' */;
--
-- Create schema test102
--

CREATE DATABASE IF NOT EXISTS test102;
USE test102;
DROP TABLE IF EXISTS `test_text`;
CREATE TABLE `test_text` (
  `id` int(10) unsigned NOT NULL auto_increment,
  `text1` varchar(200) NOT NULL,
  `text2` text NOT NULL,
  PRIMARY KEY  (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

/*!40000 ALTER TABLE `test_text` DISABLE KEYS */;
INSERT INTO `test_text` (`id`,`text1`,`text2`) VALUES 
 (1,'שלום','שלום');
/*!40000 ALTER TABLE `test_text` ENABLE KEYS */;
/*!40101 SET SQL_MODE=@OLD_SQL_MODE */;
/*!40014 SET FOREIGN_KEY_CHECKS=@OLD_FOREIGN_KEY_CHECKS */;
/*!40014 SET UNIQUE_CHECKS=@OLD_UNIQUE_CHECKS */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;
/*!40101 SET CHARACTER_SET_RESULTS=@OLD_CHARACTER_SET_RESULTS */;
/*!40101 SET COLLATION_CONNECTION=@OLD_COLLATION_CONNECTION */;
/*!40101 SET CHARACTER_SET_CLIENT=@OLD_CHARACTER_SET_CLIENT */;

------------
This is the ASP cose:
<%@ Language=VBScript codepage="65001"%>
<%
set cn = server.CreateObject("ADODB.Connection")
set rs = Server.CreateObject("ADODB.RecordSet")
cn.Open "Driver={MySQL Connector/ODBC v5}; Server=localhost; CharSet=utf8; Option=0; Socket=; Stmt=; Database=test102; Uid=myUid; Pwd=myPass;" 

SQL = "select text1, text2 from test_text where id = 1"
rs.Open SQL,cn,3,1
Response.Write rs("text1")&"<BR>"
Response.Write rs("text2")&"<BR>"
rs.close
cn.close
%>

Hi Daniel and thanks for your report.
What does your server log and/or variables say about character_set_results? What do you get in cl client if you start it with -T option(mysql -uroot -p -T test)?

Please attach your my.ini/cnf file too.

No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".

the my.ini file

Attachment: my.ini (application/octet-stream, text), 9.07 KiB.

Sorry for the delay.
When I ran the select command form the command line (using “mysql -uroot -p -T test”) I get this:
Field   1:  `id`
Catalog:    `def`
Database:   `tog`
Table:      `test_text`
Org_table:  `test_text`
Type:       LONG
Collation:  binary (63)
Length:     10
Max_length: 1
Decimals:   0
Flags:      NOT_NULL PRI_KEY UNSIGNED AUTO_INCREMENT NUM PART_KEY

Field   2:  `text1`
Catalog:    `def`
Database:   `tog`
Table:      `test_text`
Org_table:  `test_text`
Type:       VAR_STRING
Collation:  utf8_general_ci (33)
Length:     600
Max_length: 8
Decimals:   0
Flags:      NOT_NULL NO_DEFAULT_VALUE

Field   3:  `text2`
Catalog:    `def`
Database:   `tog`
Table:      `test_text`
Org_table:  `test_text`
Type:       BLOB
Collation:  utf8_general_ci (33)
Length:     196605
Max_length: 8
Decimals:   0
Flags:      NOT_NULL BLOB NO_DEFAULT_VALUE

+----+----------+----------+
| id | text1    | text2    |
+----+----------+----------+
|  1 | ╫⌐╫£╫ץ╫¥     | ╫⌐╫£╫ץ╫¥     |
+----+----------+----------+
1 row in set (0.00 sec)

The characters in the command line windows look the same but when I copy them to notepad I see different results. (as you can see above)

Daniel, can you please try MyODBC 3.51.19 and inform me of result (3.51 branch is stable while 5.0 is still beta) ?

I use to work with 3.51, but when I changed the DB to UTF8 I replaced it to 5.0

3.51 won’t show any Hebrew characters in UTF8 only question mark

Hi Daniel,

typically a question mark output is an character mix.

Please check the following:

which character set is used for inserting the data?
Is it really UTF8?
When you load the data from a file, look if the file encoding is UTF8.
Whey you load the data from a dump, look that there is nothing else written than UTF8.

Is all what you use really set to UTF8?

Regards,

Susanne

Hi Daniel again,

Sorry, I forgot to ask: which version did you try?
Are you really using version 3.51.19 for your tests?

Regards,

Susanne

I’m using “mysql-connector-odbc-3.51.19-win32.zip”
I don’t think it’s a character mix, if I use the same ASP script using ODBC 5.0 I can see the characters in Hebrew.

Hi Daniel,

sorry, that I have so many questions, but to verify encoding problems is always a difficult job.

You told, that the signs are different, when you copy them into notepad. Thats looks like the notepad use another encoding than UTF8. As far as I know, usually, notepad is using the system encoding.

To figure out, where the exactly is, we have to look from the beginning.

You made a:  INSERT INTO `test_text` (`id`,`text1`,`text2`) VALUES 
 (1,'שלום','שלום');

by using a script. Or do you type this in the mysql CLI?

When you type this in the mysql CLI, please make this:
mysql> show variables like '%char%';

You'll get an output with all variables that contains 'char' in the name.
Important is the variable: character_set_client
Please look, if this variable is set to UTF8.

If not, type: set character_set_client=utf8;
and try the insert again.

If you use a script, please look if the script really is written in utf8.
You can check this very easy with an editor. You can switch the encoding in most editors. Only with utf8 you should be able to read your Hebrew signs in the right way.

If all this is ok, than try this:
INSERT INTO `test_text` (`id`,`text1`,`text2`) VALUES 
 (1,_utf8'שלום','שלום');

Also please try to make a set names.
Try besides set names utf8 also set names hebrew.

My last question is: which version of MySQL Server do you use?

Regards,

Susanne

I’m using server version:  5.0.24a-community-nt

As for the notepad, I only use it to send the text to you. Usually I use the browser as my viewer and there I see that the VARCHAR show Hebrew and the TEXT won’t. 

I’m using the ASP to insert the data. The file is UTF8 (I double check it using notepad "save as").

I try to use the “_utf8” in the script but no help. As long as I use the ODBC 5.0 I get the same results.

I know remember why I stop using ODBC 3.51. When I use it to insert Hebrew to the DB it won’t store the character correctly. So if I read the same data using ODBC 3.51 I will see Hebrew, but if I will read it from other drivers (I have a java application on the server using JDBC and I use MySQL Query Browser) I get gibberish. 
I can prove it easily using the GUI “MySQL Query Browser” to insert Hebrew. now the java application will show Hebrew, ODBC 5.0 will show Hebrew, but ODBC 3.51 will show gibberish. 
I guess that if 3 different application (JDBC, ODBC 5.0,  Query Browser) can see correctly and only the ODBC 3.51 get it wrong… then he got the problem.

Hi Daniel,

many thanks for your time and helping us to verify the problem.

I just see that your Bug is a duplicate of Bug #13465.

If you diagree with me, please let me know, and I will change the status back.

Regards,

Susanne

I don’t think it is the same bug. 
I can see Hebrew (and no question marks) when I use ODBC 5.00.11. 
Only if I set a Column as TEXT I will get Gibberish (no question marks) instead of Hebrew. 
So know instead of using TEXT in all the big text fields I use VARCHAR(21000). 

I know that the problem is in the reading function of the ODBC cause if I use the MySQL Query Browser I can see the Hebrew I stored using the ODBC. So the ODBC 5.0 can store Hebrew correctly in the DB but can’t restore it from there.

Bug 13465 may explain why ODBC 3.51 is not working well with Hebrew, especially when using UTF8. But I already know that and I found that ODBC 5.0 is doing better job handling Hebrew characters on UTF8. The only problem I reported is that the ODBC 5.0 has problem handling Hebrew characters on restore from datatype TEXT.

Hi Daniel,

you missunderstand me. We are not talking about ODBC 5, we are only talking about ODBC 3.51.19.

Regards,

Susanne

I guess we misunderstand each other. I open the bug for v5.0 
I know the problems on v3.51 and read an answer for a different bug (don’t remember the number) that for good support on UTF8 it recommendation to move to version 5.0. I did that and v5.0 is better. Now I hope we can find a solution for this TEXT datatype problem And I will feel like the king of the world. Working with the best DB on market that have full Multilanguage support.

Hi Daniel,

because version 5.0 is still beta we will fix your problems first in version 3.51.
And we will hope, that you can use version 3.51 soon without having problems.

Regards,

Susanne

Daniel, please check your results and client charsets... What I believe is happening in this:
 - You insist on putting UTF8 characters into database
 - Characters get converted on client to latin1
 - Characters get converted again on server as table is UTF8
the result is garbage, even when read with MySQL QB for example. So, let's make a short test, what do other clients say about stuff you saved in table? Does MySQL QB show Hebrew characters?

Now, fetching garbage afterwards produces garbage, that's expected.

OK, V3.51 is a different ball game.

I know that V3.51 not only stored the Hebrew wrong, if I try to read it from JDBC, MySQL QB (or ODBC 5.0) I will get gibberish.
I know that 3.51 can’t read the Hebrew data correctly from the DB. If I store the data using MySQL QB, JDBC (or ODBC 5.0) the V3.51 will show question marks (all the other driver show Hebrew) 
Last knowledge that I can share is that if I store the Hebrew using V3.51 I can read it as Hebrew only from V3.51

Looking at this behavior I can guess that the solution is hiding in the code of V5.0 since there this problem is not exists. I can store and read Hebrew to\in the DB and all other drivers can see it correctly.
 
As for script problems… I’m using ASP. the file is saved as UTF8, codepage set to UTF8 and when I to the test for V5.0 and V3.51 I use the same ASP file. Only one line is change (the one that set the ADODB.Connection to use the ODBC)

P.s.
It would be my honor to assist you with the V5.0 bug as well.

Daniel, thanks for your report and your kind words.

Verified as described on both accounts: 
 - v5 having problems with TEXT
 - v3 having problems with Hebrew

I believe it is likely we'll see the fix for 3.51 branch first so keep an eye on new releases.

The bug is in MyODBC version 5.1 too.
I can't get back Hebrew signs.

This needs to be re-tested after Bug#32161 is fixed. If that causes ADO to bind SQL_C_WCHAR for UTF-8 columns, it should be fixed.
-----------
Susanne,

Your test is incorrect. It needs to specify a character set that supports Hebrew in the connection string or bind the column as SQL_C_WCHAR.

haha, thanks for Danil,

I got the same problem with CJK uses.
 - v5.1 having problems with TEXT in utf8(Hebrew/CJK)
 - v3.51 having problems with utf8(Hebrew/CJK)

varchar is OK with v5.1
please fix 5.1
or provid utf8 support in 3.51

thanks at all...

Unicode works fine through ADO, this patch fixes TEXT handling. For ASP you need the following in your ASP (db data will be converted to UTF8 for browser output):

      Response.CodePage = 65001
      Response.CharSet = "utf-8"

fix text handling for SQL_C_WCHAR

Attachment: bug28617.diff (application/octet-stream, text), 1.44 KiB.

patch approved

Committed to svn r973. Will be included in 5.1.2 release.

A note has been added tot he 5.1.2 changelog: 

Reading a TEXT column that had been used to store UTF8 data would result in the wrong information being returned during a query.

I've also added an entry to the FAQ covering the configuration for UTF8 in ASP documents.