Bug #13662 Prepare() truncates accented character input
Submitted: 30 Sep 2005 17:34 Modified: 3 Oct 2005 18:05
Reporter: Mark Modrall
Status: Closed
Category:Connector/Net Severity:S2 (Serious)
Version:1.0.4 OS:Microsoft Windows (Windows)
Assigned to: Reggie Burnett Target Version:

[30 Sep 2005 17:34] Mark Modrall
Description:
Another problem with the Prepare() statement in the .Net connector.  Inserting accented
latinate characters into a utf8 table with a utf8 connection, using a Prepare()d Insert
statement truncates the input at the first accented character.  

For example, "resumé" gets inserted as "resum", "élan" gets put in as "".

Really would be nice to get the performance benefit of Prepare() but it seems to break so
many things.

How to repeat:
	string createTable = "CREATE TABLE IF NOT EXISTS prepBug (\n"+
	"id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,\n" +
	"input TEXT NOT NULL\n" +
	") CHARACTER SET UTF8 COLLATE utf8_bin;";

	string insertTable = "Insert into prepBug (input) " +
	"VALUES (?input) ON DUPLICATE KEY UPDATE " +
	"id=LAST_INSERT_ID(id);";
			
	string [] input = { "irache martínez@yahoo.es aol.com",
			"ç ºäúµå á ù î",
			"ç ñ ì á è b f",
			"àøñðì",
			"á ã öá" };
		
	MySqlConnection conn = new MySqlConnection ("server=x;user
id=y;password=z;database=a;pooling=false;charset=utf8" );
	conn.Open();
			
	MySqlCommand createCmd = conn.CreateCommand(), insBug = conn.CreateCommand();
							
	createCmd.CommandText = createTable;
	createCmd.ExecuteNonQuery();

	insBug.CommandText = insertTable;
	insBug.Parameters.Add (new MySqlParameter("?input", ""));
	insBug.Prepare();

	int records = 0;
	foreach (string inp in input)
	{
		insBug.Parameters["input"].Value = inp;
		int rawInsert = insBug.ExecuteNonQuery();
		records++;
	}

// Run this and do a select * from prepBug; you'll see that all the data in the table
// is truncated at the first accented character.  Comment out the Prepare() and
// the data goes in okay.

Suggested fix:
stop using Prepare() unfortunately.
[2 Oct 2005 10:13] Valeriy Kravchuk
Thank you for a bug report. What version of MySQL server do you use?
[3 Oct 2005 16:09] Mark Modrall
We're on 4.1.12 server on an intel linux box.
[3 Oct 2005 18:05] Reggie Burnett
Thank you for taking the time to report a problem.  Unfortunately
you are not using a current version of the product your reported a
problem with -- the problem might already be fixed. Please download
a new version from http://www.mysql.com/downloads/

If you are able to reproduce the bug with one of the latest versions,
please change the version on this bug report to the version you
tested and change the status back to "Open".  Again, thank you for
your continued support of MySQL.

Additional info:

Mark

I confirmed this was broken in 1.0.4 and is fixed now in 1.0.6.  One thing you need to be
aware of is some quirkiness with collation.  In your example you are  using utf8_bin. 
THis shouldn't cause the server to return the data to the connector as a binary blob but
with current versions of 4.1 and 5.0 it does.  The connectors do their best to return the
right types but sometimes we can't.  In your example, the columns will be returned from an
executescalar as byte[].  Changing the collation to be not binary fixes it.  You can
always do a reader.GetString() or convert to string yourself.  

I am about to test some code that will handle this particular case but I just wanted  you
to be aware of the situation in general.
[3 Oct 2005 18:43] Mark Modrall
Hi Reggie...

Thanks for the update.  I was unclear on one thing, however.  You said that the problem is
fixed in 1.0.6 and then had the note about what the binary collation does to fetches.  Am
I correct in understanding these as two separate points?  The first point, on which the
bug was filed, has to do with data being shipped up to the db, and the second point being
the representation of the binary collation when data is fetched back out.  Am I
understanding you correctly?

I switched to using the binary collation because the behavior of regular utf8 was kind of
unpredictable.  If you created an index on a text column in utf8, there is some kind of
implicit normalization going on that was, at least to me, inconsistent.  In terms of how
the index was constructed, it seemed that diacriticals were all normalized out so resumé
was the same as resume for the index.  But that normalization wasn't applied to the data
in the column, so whichever presentation got inserted first was the one that stuck.

For example if you have a UNIQUE index on the text column and you insert resumé, then you
try to insert resume, it will say it's a duplicate key because the index has stripped the
diacriticals.  If you fetch that column from the database, though, it comes back as
resumé (the diacritical not stripped) because that was the first one inserted.  Not being
able to get a consistent answer, I switched to using utf8_bin so that the normalization on
the index wouldn't happen.

So far anyway getting the results back as a byte[] hasn't really been that hard to work
with.  So if I understood your points correctly, I can upgrade to 1.0.6 and get this
truncation problem addressed while still  keeping my utf8_bin for consistency, right?

As an aside, how about but #12245?  that was marked closed but I haven't seen it in any of
the release notes)

Thanks
-Mark
[3 Oct 2005 19:12] Reggie Burnett
Mark

Correct.  There are two issues in my post.  There was a bug in 1.0.4 (which is now fixed).
 But I also wanted to bring the collation thing to your attention.  It can be confusing to
select a text col and get a byte[] back.