Bug #35313 Insertion of some utf-8 characters becomes garbled
Submitted: 15 Mar 2008 20:21 Modified: 2 Apr 2008 18:12
Reporter: Någon Person Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.0.45 OS:Linux (5.0.45-Debian_1ubuntu3.1)
Assigned to: CPU Architecture:Any
Tags: prime character, UTF-8

[15 Mar 2008 20:21] Någon Person
Description:
Connector versions tried: 1.0.9, 5.2 both show same problem. Please note that it's not a normal apostrophe (') or a slanted one (´)/(`) both of which work but an unusual: (′). This character is named "prime" and has code U+2032 unicode.

The test-file is saved as UTF-8 without signature (BOM-mark) in Visual Studio 2005. Collection in database is utf8_general_ci, charset is utf8 in row, database and table level and latin1 on system level. I checked up the connection charset (encoding) as well and that was utf_8 as well.

As you can see, I took the example from this page: http://bugs.mysql.com/bug.php?id=21246 but as you can see in my test-case not even Prepare() helps!

Problem is on database version: 5.0.45-Debian_1ubuntu3.1 -- but it works on localhost dev db of version 5.0.45-community-nt on Win XP Pro SP2

Test-case (C#):

[Test]
public void CanInsertUnescapedChar()
{
	using (MySqlConnection conn =
		new MySqlConnection(" the connection string goes here "))
	{
		conn.Open();
		using (MySqlCommand cmd = conn.CreateCommand())
		{
			cmd.CommandText = "INSERT INTO test(str) VALUES (?StringValue)";
			cmd.Prepare();
			cmd.Parameters.Add(new MySqlParameter("?StringValue", "O′Connor"));
			cmd.ExecuteNonQuery();
		}
	}
}

Exception thrown:

You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Connor')' at line 1
	MySql.Data.MySqlClient.MySqlException
	Message: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near 'Connor')' at line 1
	Source: MySql.Data
	StackTrace:
	at MySql.Data.MySqlClient.MySqlStream.OpenPacket()
	at MySql.Data.MySqlClient.NativeDriver.ReadResult(UInt64& affectedRows, Int64& lastInsertId)
	at MySql.Data.MySqlClient.MySqlDataReader.GetResultSet()
	at MySql.Data.MySqlClient.MySqlDataReader.NextResult()
	at MySql.Data.MySqlClient.MySqlCommand.ExecuteReader(CommandBehavior behavior)
	at MySql.Data.MySqlClient.MySqlCommand.ExecuteNonQuery()

How to repeat:
CREATE TABLE `logibit_cms`.`test2` (
  `str` TEXT CHARACTER SET utf8 COLLATE utf8_general_ci NOT NULL,
  PRIMARY KEY (`str`)
)
ENGINE = InnoDB
CHARACTER SET utf8 COLLATE utf8_general_ci;

Suggested fix:
No fix available for me. Works on my Windows machine of same db version for some reason.
[16 Mar 2008 19:58] Peter Laursen
I am copying the 'prime' character from your post, inserting (succesfully) to a utf8 column with a GUI editor (SQLyog) and executing 

select hex(mychar) from utf8test;

returns: 

hex(mychar)
-----------
E280B2

what should be correct according to:
http://www.fileformat.info/info/unicode/char/2032/index.htm
(and it also displays correctly in SQLyog)

Tested on two servers
Server 5.0.51a on Win Vista.
Server 4.0.20 on Linux (RHEL4)

Are you claiming that this is an issue with Linux only (or even some specific distro)?  I do not fully understand the meaning of your post!  Please explain?

It does not seem to me there is an issue with the server here.  It could be with the .NET connector however.  Or C# itself or that Linux build?

Peter
(not a MySQL person)
[17 Mar 2008 8:04] Tonci Grgin
Hi Någon and thanks for your report.

The quote you're using is Unicode quote and I'm not sure MySQL uses those at all so no need to escape them. Also, if you don't specify utf8 as your connection charset strings will be converted to the default encoding (latin1?).  That will convert these Unicode quotes to latin1 quotes which are quoted.

Mysql cl client charset UTF8:
| 27 | bg`inning-te'st-US.divisionâ?ts  | << charset=null
| 28 | bg`inning-te'st-US.divisionâ?ts  | << charset=utf8
+----+----------------------------------+

Mysql cl client charset latin1:
| 27 | bg`inning-te'st-US.division's  | << charset=null
| 28 | bg`inning-te'st-US.division's  | << charset=utf8
+----+--------------------------------+

So, please test your environment first, especially variables like "char" and SQL modes in effect.

I don't think this is a bug but simple misconfiguration between server and client.
[17 Mar 2008 9:41] Peter Laursen
Isn't it possible simply to "SET NAMES UTF8" through the .NET connector after connection?
[17 Mar 2008 10:03] Tonci Grgin
Peter, of course it is but to what avail? What if username/password are UTF8 encoded? I can think of few other cases too... So, best thing to do (for c/NET) is to add use of UTF8 in connection string and use UTF8 table. That way one will be able to see quote as he/she intended.
[17 Mar 2008 10:36] Någon Person
Hello,

Since I did the test-case the connection string was utf-8 as I stated. I tried doing a "show variables like '%char%'" and got the same as on my local database, except:

character_set_server  	latin1

which is utf-8 on my local. However, I don't see how this can affect the saving of values into a table with utf8 into a database in utf8 into a cell with utf8...?

Also, tried "show variables like '%mode%';" which didn't give my anything on the server, but at local it gave me 'sql_mode', 'STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION'.

So, can you see something here that I should fix?
[17 Mar 2008 10:43] Någon Person
"Are you claiming that this is an issue with Linux only (or even some specific distro)?  I
do not fully understand the meaning of your post!  Please explain?

It does not seem to me there is an issue with the server here.  It could be with the .NET
connector however.  Or C# itself or that Linux build?"

I'm not claiming anything, I don't know. Deductive logic tells me it's an issue with A linux build, but I can't do induction over it and say it's a BUG as in a BUG throughout linux distributions of MySQL. The meaning is that the server throws exceptions when some utf-8 characters are inserted, while other characters like ∀ or ∃ are successfully inserted but retrieved as garbled data (aka. questionmarks). So ∃c:char[utf(c) ∧ problemusing(c)]

I don't think it's with the connector, because it works varying the database, keeping the connector constant. Also, I very highly doubt it's in C# or the common language routine for the same reason...
[17 Mar 2008 10:52] Tonci Grgin
Någon, paste result of show variables like 'char%' here please.

Most usual error I encounter is double conversion. You presume you're inputting UTF8 chars while they are converted to latin1 bytes and then converted again to UTF8 to be saved in UTF8 table/field. This is what is happening, I believe.
[17 Mar 2008 10:53] Tonci Grgin
http://dev.mysql.com/doc/refman/5.1/en/charset-connection.html
[17 Mar 2008 10:54] Peter Laursen
"because it works varying the database, keeping the connector constant."  

1)
will it make a difference to execute "SET NAMES utf8" after connection?

2)
try insert 

USE the_database_you_use;
SHOW VARIABLES LIKE '%character%';

in your script (before and after SET NAMES) and retrieve the output.
[17 Mar 2008 12:41] Tonci Grgin
Reposting test case without sensitive info:
OK, so I expanded the test-case:

public void DeleteFromTest(MySqlConnection conn)
{
	using (MySqlCommand cmd = conn.CreateCommand())
	{
		cmd.CommandText = "delete from test where 1=1";
		cmd.Prepare();
		cmd.ExecuteNonQuery();
	}
}

[Test]
public void CanReadUTF()
{
	string localConn =
"Server=localhost;Database=logibit_cms;Uid=*****;Pwd=*****;";
	string remoteConn =
"Server=some_IP;Port=3306;Database=*****;Uid=*****;Pwd=*****;";

	using (MySqlConnection conn = new MySqlConnection(localConn))
		//new MySqlConnection(localConn))
	{
		conn.Open();

		DeleteFromTest(conn); // reset table.

		using (MySqlCommand cmd = conn.CreateCommand())
			printVariables(cmd);

		using (MySqlCommand cmd = conn.CreateCommand())
		{
			Console.WriteLine("Setting charset.");
			setCharset(cmd);

			Console.WriteLine("Printing variables");
			printVariables(cmd);

			Console.WriteLine("Inserting data into test.");
			cmd.CommandText = "INSERT INTO test(str) VALUES (?StringValue)";
			cmd.Prepare();
			cmd.Parameters.Add(new MySqlParameter("?StringValue", "∀ ∃"));
			cmd.ExecuteNonQuery();

			Console.WriteLine("Selecting everything from test.");
			cmd.CommandText = "SELECT * FROM test";
			cmd.Prepare();

			using (MySqlDataReader reader = cmd.ExecuteReader())
			{
				while (reader.Read())
					Console.WriteLine(reader.GetString(0));
			}
		}

		Console.WriteLine("Creating a new command, printing variables and
then selecting everything.");

		using (MySqlCommand cmd = conn.CreateCommand())
		{
			printVariables(cmd);
			cmd.CommandText = "SELECT * FROM test";
			cmd.Prepare();
			using (MySqlDataReader reader = cmd.ExecuteReader())
			{
				while (reader.Read())
					Console.WriteLine(reader.GetString(0));
			}
		}
	}
}

private void printVariables(MySqlCommand cmd)
{
	cmd.CommandText = "show variables like 'char%'";
	cmd.Prepare();

	using (MySqlDataReader reader = cmd.ExecuteReader())
	{
		while (reader.Read())
			Console.WriteLine(reader.GetString(0) + ": " +
reader.GetString(1));
	}
}

private void setCharset(MySqlCommand cmd)
{
	cmd.CommandText = "SET NAMES 'utf8' COLLATE 'utf8_general_ci';";
	cmd.Prepare();
	cmd.ExecuteNonQuery();

	cmd.CommandText = "SET CHARACTER SET utf8;";
	cmd.Prepare();
	cmd.ExecuteNonQuery();
}

output using localConn:
Console Output 
character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_filesystem: binary
character_set_results: 
character_set_server: utf8
character_set_system: utf8
character_sets_dir: C:\Program\MySQL\MySQL Server 5.0\share\charsets\
Setting charset.
Printing variables
character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_filesystem: binary
character_set_results: utf8
character_set_server: utf8
character_set_system: utf8
character_sets_dir: C:\Program\MySQL\MySQL Server 5.0\share\charsets\
Inserting data into test.
Selecting everything from test.
∀ ∃
Creating a new command, printing variables and then selecting
everything.
character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_filesystem: binary
character_set_results: utf8
character_set_server: utf8
character_set_system: utf8
character_sets_dir: C:\Program\MySQL\MySQL Server 5.0\share\charsets\
∀ ∃

 ***
Using remoteConn:
Console Output 
character_set_client: latin1
character_set_connection: latin1
character_set_database: utf8
character_set_filesystem: binary
character_set_results: 
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/share/mysql/charsets/
Setting charset.
Printing variables
character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_filesystem: binary
character_set_results: utf8
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/share/mysql/charsets/
Inserting data into test.
Selecting everything from test.
? ?
Creating a new command, printing variables and then selecting
everything.
character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_filesystem: binary
character_set_results: utf8
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/share/mysql/charsets/
? ?

------------------------------------------------------------------------
[20 Mar 2008 19:57] Någon Person
So since I posted the code and imho it still shows the problem... Am I missing something in the documentation or are you still investigating?
[20 Mar 2008 21:12] Peter Laursen
As I understand Tonci's code sample it shows that if you configure server, connection and client to use the same charset for the communication (with SET NAMES statement) there is no issue!

Alternativly change server default charset to uft8.

Tonci?? .. 'not a bug' ??
[21 Mar 2008 0:39] Någon Person
"As I understand Tonci's code sample it shows that if you configure server, connection and client to use the same charset for the communication (with SET NAMES statement) there is no issue!"

Well, isn't that what this output means I am doing?

Setting charset. (see the method)
Printing variables
character_set_client: utf8
character_set_connection: utf8
character_set_database: utf8
character_set_filesystem: binary
character_set_results: utf8
character_set_server: latin1
character_set_system: utf8
character_sets_dir: /usr/share/mysql/charsets/
Inserting data into test.
Selecting everything from test.
? ?

Or how otherwise do I use SET NAMES?
[21 Mar 2008 1:02] Någon Person
Oh... You think it's his code sample.

It's my code sample, but he reposted it because I forgot to remove the passwords from the connection strings.
[21 Mar 2008 8:17] Tonci Grgin
Guys, bare with me for a while, will you? I'm swamped with work and work-related tasks right now.
[25 Mar 2008 12:57] Någon Person
I tested with CharSet=utf8; on the connection string. That worked. SET NAMES hence doesn't work and even if we're querying for the connection charset (it being utf-8 in the results) it seems that it is not really utf-8 like it says.

Cheers
[25 Mar 2008 18:00] Tonci Grgin
Någon, as I said before, this is the very essence of your problem, a mismatch in encodings:
"I tested with CharSet=utf8; on the connection string. That worked. SET NAMES hence doesn't work and even if we're querying for the connection charset (it being utf-8 in the results) it seems that it is not really utf-8 like it says."

For c/NET to function properly you *have* to put "CharSet=utf8;" in connection string as it calls "SET character_set_results = NULL" (in fact any other driver does so too) for the reasons mentioned above as the most important in charsets is field encoding and session encoding. Only if they mismatch, any conversions are done. So, "SET character_set_results = NULL" actually means "ok, the connector will discover what is used on field level".
So I actually believe that the double encoding happened, utf8 bytes are seen as latin1 chars and encoded into utf8 yet again. This is usual corruption. Or, in other words, you were treating data as utf8 without marking it as such.

Another point of interest is default latin1 behavior of libmysqlclient, which, when connecting, says "Hi, I'm latin1", using binary protocol, of course. So, connectors need to work-around this behavior too (or issue --skip-character-set-handshake but that's no guard against "SET NAMES").

I'm closing this as !Bg.