MySQL Bugs: #8943: utf8 character set error

Bug #8943	utf8 character set error
Submitted:	4 Mar 2005 9:19	Modified:	9 Mar 2005 8:31
Reporter:	Peter Chen	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S3 (Non-critical)
Version:	4.1.7	OS:	Windows (Win 2000 Pro)
Assigned to:	Alexander Barkov	CPU Architecture:	Any

Description:
In my InnoDB table, I have a CHAR(16) column which is used to save password. When I tried to save a character "0x9d" into this column. It converted to "0x3f".
My table is using utf8 character set. I tried all the characters and I found that four other characters - "0x81","0x8d","0x8f","0x90" would be converted to "0x3f". But when I converted the table to latin1 character set, it worked very well. Below is my basic situation:

OS: Win 2000 Pro
Cluster: 4.1.7
engine: InnoDB

How to repeat:
mysql> Create table test (col1 char(16));
mysql> INSERT INTO test VALUES (CHAR(129,141,143,144,157,129,141,143,144,157,129,141,143,144,157,129));

Suggested fix:
Tell me why

my operation steps

Attachment: MySQL.jpg (image/pjpeg, text), 55.44 KiB.

Peter,

1-byte character codes where the highest bit is 1 are not valid UTF-8 character codes. You should use the appropriate 2-byte UTF-8 code for such characters.

Regards,

Heikki

Then why do they all work except for the 4 he mentions? You can see in his test string that all of the characters are 1-byte highest bit set kind, but he only has trouble with 4 of them. Me and others are having similar problems, check bugs #8949, 8730 and 8672.

MySQL is eating those chars up.

I tested this with 4.1.11-bk. The MySQL "latin1" is actually Windows Code Page 1252, which makes use of some code positions that are undefined in ISO 8859-1. To my knowledge, ISO 8859-1 leaves the code positions 0x80..0x9f undefined or reserved for control characters.

To my knowledge, Unicode is a superset of latin1, meaning that Latin1 codes 0x00..0x7f and 0xa0..0xff correspond to the codes U+0000..U+007f and U+00a0..U+00ff. I'm not sure, but I would tend to believe that Unicode leaves U+0080..U+00bf reserved for control characters.

In my tests, the CHAR() function seemed to be interpreted as MySQL latin1 character codes. For example, CHAR(244,241,255) corresponds to F4F1FF in a latin1 column, and C3B4C3B1C3BF in a UTF-8 encoded Unicode column.

It seems that the high-order bits of CHAR() arguments are ignored: char(288,289) will not correspond to U+0120 and U+0121 but " !". That could be viewed as a bug of MySQL, or as a feature request. That's why I'm assigning this bug to our character set expert Alexander Barkov.

This is not a bug. mysql.exe is a DOS-alike program which uses
cp850 character set, not cp1251. You should run "SET NAMES cp850;"
after connect, or start mysql with:

mysql.exe --default-character-set=cp850 -uroot -p test

or  write default-character-set=850 into your my.cnf file in [mysql]
or [client] sections.

Bar,
are you saying that the behavior of the CHAR() function depends on the character_set_% variables?
I think that it should be more clearly documented what CHAR() does with arguments greater than 126.
I also think that CHAR(288) in UTF-8 context (insert into UTF-8 column values(char(288)) and mysql --default-character-set=utf8) should mean 0xc4 0xa0 and not 0x01 0x20, as it appears to be (in 5.0.3-bk).