Bug #13662 Prepare() truncates accented character input
Submitted: 30 Sep 2005 15:34 Modified: 3 Oct 2005 16:05
Reporter: Mark Modrall Email Updates:
Status: Closed Impact on me:
None 
Category:Connector / NET Severity:S2 (Serious)
Version:1.0.4 OS:Windows (Windows)
Assigned to: Reggie Burnett CPU Architecture:Any

[30 Sep 2005 15:34] Mark Modrall
Description:
Another problem with the Prepare() statement in the .Net connector.  Inserting accented latinate characters into a utf8 table with a utf8 connection, using a Prepare()d Insert statement truncates the input at the first accented character.  

For example, "resumé" gets inserted as "resum", "élan" gets put in as "".

Really would be nice to get the performance benefit of Prepare() but it seems to break so many things.

How to repeat:
	string createTable = "CREATE TABLE IF NOT EXISTS prepBug (\n"+
	"id INT UNSIGNED NOT NULL PRIMARY KEY AUTO_INCREMENT,\n" +
	"input TEXT NOT NULL\n" +
	") CHARACTER SET UTF8 COLLATE utf8_bin;";

	string insertTable = "Insert into prepBug (input) " +
	"VALUES (?input) ON DUPLICATE KEY UPDATE " +
	"id=LAST_INSERT_ID(id);";
			
	string [] input = { "irache martínez@yahoo.es aol.com",
			"ç ºäúµå á ù î",
			"ç ñ ì á è b f",
			"àøñðì",
			"á ã öá" };
		
	MySqlConnection conn = new MySqlConnection ("server=x;user id=y;password=z;database=a;pooling=false;charset=utf8" );
	conn.Open();
			
	MySqlCommand createCmd = conn.CreateCommand(), insBug = conn.CreateCommand();
							
	createCmd.CommandText = createTable;
	createCmd.ExecuteNonQuery();

	insBug.CommandText = insertTable;
	insBug.Parameters.Add (new MySqlParameter("?input", ""));
	insBug.Prepare();

	int records = 0;
	foreach (string inp in input)
	{
		insBug.Parameters["input"].Value = inp;
		int rawInsert = insBug.ExecuteNonQuery();
		records++;
	}

// Run this and do a select * from prepBug; you'll see that all the data in the table
// is truncated at the first accented character.  Comment out the Prepare() and
// the data goes in okay.

Suggested fix:
stop using Prepare() unfortunately.
[2 Oct 2005 8:13] Valeriy Kravchuk
Thank you for a bug report. What version of MySQL server do you use?
[3 Oct 2005 14:09] Mark Modrall
We're on 4.1.12 server on an intel linux box.
[3 Oct 2005 16:05] Reggie Burnett
Thank you for taking the time to report a problem.  Unfortunately
you are not using a current version of the product your reported a
problem with -- the problem might already be fixed. Please download
a new version from http://www.mysql.com/downloads/

If you are able to reproduce the bug with one of the latest versions,
please change the version on this bug report to the version you
tested and change the status back to "Open".  Again, thank you for
your continued support of MySQL.

Additional info:

Mark

I confirmed this was broken in 1.0.4 and is fixed now in 1.0.6.  One thing you need to be aware of is some quirkiness with collation.  In your example you are  using utf8_bin.  THis shouldn't cause the server to return the data to the connector as a binary blob but with current versions of 4.1 and 5.0 it does.  The connectors do their best to return the right types but sometimes we can't.  In your example, the columns will be returned from an executescalar as byte[].  Changing the collation to be not binary fixes it.  You can always do a reader.GetString() or convert to string yourself.  

I am about to test some code that will handle this particular case but I just wanted  you to be aware of the situation in general.
[3 Oct 2005 16:43] Mark Modrall
Hi Reggie...

Thanks for the update.  I was unclear on one thing, however.  You said that the problem is fixed in 1.0.6 and then had the note about what the binary collation does to fetches.  Am I correct in understanding these as two separate points?  The first point, on which the bug was filed, has to do with data being shipped up to the db, and the second point being the representation of the binary collation when data is fetched back out.  Am I understanding you correctly?

I switched to using the binary collation because the behavior of regular utf8 was kind of unpredictable.  If you created an index on a text column in utf8, there is some kind of implicit normalization going on that was, at least to me, inconsistent.  In terms of how the index was constructed, it seemed that diacriticals were all normalized out so resumé was the same as resume for the index.  But that normalization wasn't applied to the data in the column, so whichever presentation got inserted first was the one that stuck.

For example if you have a UNIQUE index on the text column and you insert resumé, then you try to insert resume, it will say it's a duplicate key because the index has stripped the diacriticals.  If you fetch that column from the database, though, it comes back as resumé (the diacritical not stripped) because that was the first one inserted.  Not being able to get a consistent answer, I switched to using utf8_bin so that the normalization on the index wouldn't happen.

So far anyway getting the results back as a byte[] hasn't really been that hard to work with.  So if I understood your points correctly, I can upgrade to 1.0.6 and get this truncation problem addressed while still  keeping my utf8_bin for consistency, right?

As an aside, how about but #12245?  that was marked closed but I haven't seen it in any of the release notes)

Thanks
-Mark
[3 Oct 2005 17:12] Reggie Burnett
Mark

Correct.  There are two issues in my post.  There was a bug in 1.0.4 (which is now fixed).  But I also wanted to bring the collation thing to your attention.  It can be confusing to select a text col and get a byte[] back.