Bug #36691 | invalid characters are incorrectly accepted when given as literals | ||
---|---|---|---|
Submitted: | 13 May 2008 15:27 | Modified: | 15 May 2008 8:05 |
Reporter: | Mark Hershberger | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server | Severity: | S3 (Non-critical) |
Version: | 5.0.51a | OS: | Any |
Assigned to: | Hartmut Holzgraefe | CPU Architecture: | Any |
[13 May 2008 15:27]
Mark Hershberger
[13 May 2008 20:23]
Peter Laursen
I think they should both return an ERROR (and thus *nothing* be inserted to a utf8-encoded string column!).
[13 May 2008 20:39]
Peter Laursen
my mistake .. the literal '\222' is valid of course! But with UNHEX docs say: "Performs the inverse operation of HEX(str). That is, it interprets each pair of hexadecimal digits in the argument as a number and converts it to the character represented by the number. The resulting characters are returned as a binary string." hmmm .. a 'binary string' inserted to an encoded column type .. don't like it! For binary types (binary, varbinary, BLOB) it will work but for char/varchar/text I think the byte pattern returned should be validated to fit with the encoding -- or 'binary strings' returned should simply not be allowed to be inserted.
[14 May 2008 2:03]
Mark Hershberger
\222 is the same as x'92' which, as a single byte, is not a valid unicode char. This first cropped up when we used drupal to convert latin1 tables to utf8. The conversion was successful except where windows-codepage smart quotes and the like (0x92 in Windows Codepage 1252 is equivalent to the codepoint U+2019 RIGHT SINGLE QUOTATION MARK.) Where these invalid codepoints occurred, the data was truncated when the column was changed from latin1 to utf8. Of course, I'm more than a little disappointed that MySQL didn't error out and complain loudly about the data corruption...
[14 May 2008 7:57]
Peter Laursen
'\222' is a string constant - just like 'mark', 'peter' 'you&me', 'blahblah' and '!"#¤%&/' .. so that literal string will simply be inserted (after being encoded with the charset of the column). But inserting a 'binary string' to an encoded string type (varchar) is wrong in my opinion! So inserted unhex(something) to a char/varchar/text should raise an error - at least when the binary pattern is not valid with the encoding!
[14 May 2008 8:34]
Peter Laursen
also note that in UTF8 a single byte with the value 222 is not a valid character at all. In UTF8 the ASCII range is mapped from 0 to 127, byte values >127 only occur as part of a multibyte character. So what you are inserting is not a valid character in UTF8 .. and that is why nothing displays I think. But it should not insert at all!
[14 May 2008 8:44]
Peter Laursen
In latin1 charset however a sinlge byte with the value hex(92) is mapped to a character. It will insert without warnings and print. CREATE TEMPORARY TABLE `test` ( `id` int(11) NOT NULL auto_increment, `s` varchar(5) default NULL, PRIMARY KEY (`id`) ) ENGINE=MyISAM DEFAULT CHARSET=latin1; insert into test (s) values (unhex(92)); show warnings; -- returns empty set select * from test; /* returns id s ------ ------ 1 ’ */
[14 May 2008 15:35]
Susanne Ebrecht
Verified as described. You won't get an error message here.
[14 May 2008 15:46]
Hartmut Holzgraefe
Expected behavior ... MySQL doesn't have a "\###" escape sequence for octal presentation of special characters. See http://dev.mysql.com/doc/refman/5.1/en/string-syntax.html and http://dev.mysql.com/doc/refman/5.1/en/hexadecimal-values.html for the valid backslash sequences and the possible representations of byte values. Possible encodings for the 0x92 value in MySQL are 0x92 or x'92' which both produce the expected warning: mysql> insert into test (s) values (0x92); Query OK, 1 row affected, 1 warning (0.00 sec) mysql> show warnings; +---------+------+--------------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------------+ | Warning | 1366 | Incorrect string value: '\x92' for column 's' at row 1 | +---------+------+--------------------------------------------------------+ 1 row in set (0.00 sec) mysql> insert into test (s) values (x'92'); Query OK, 1 row affected, 1 warning (0.00 sec) mysql> show warnings; +---------+------+--------------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------------+ | Warning | 1366 | Incorrect string value: '\x92' for column 's' at row 1 | +---------+------+--------------------------------------------------------+ 1 row in set (0.00 sec) '\222' is interpreted as the string "222" though as documented on the string literals page: For all other escape sequences, backslash is ignored. That is, the escaped character is interpreted as if it was not escaped. For example, “\x” is just “x”. So it comes down to mysql> select length('\222'); +----------------+ | length('\222') | +----------------+ | 3 | +----------------+ 1 row in set (0.12 sec) mysql> select hex('\222'); +-------------+ | hex('\222') | +-------------+ | 323232 | +-------------+ 1 row in set (0.00 sec) as documented, and as '2' (0x32) is a perfectly valid UTF-8 character no warning is shown in that case. PS: you may have been confusing this which the way C and PHP handle \### sequences ...
[14 May 2008 18:37]
Mark Hershberger
I am not using the sequence "backslash two two two". In my report, I only used '\222' because I wasn't sure how to explain that what MySQL was doing with the actual character. That is, where I put \222, I meant only a single byte (0x92, \222, or ’) was being provided by the client. In this case, where a single byte (not the "backslash two two two" sequence) is provided to the server via the client, MySQL improperly inserts the byte (0x92, \222, or ’) into a UTF-8 encoded column as a character where, in fact, it is not valid UTF-8 as a single byte. Please look at the sample session I initially provided again. Just to be absolutely clear, the sample I provided would not return a length of 3, it would be one. HEX() returns 92. If there is a way to communicate exactly which character I mean without people being confused and thinking that I mean "backslash two two two", please let me know and I'll use it in the future.
[14 May 2008 20:05]
Peter Laursen
"That is, where I put \222, I meant only a single byte (0x92, \222, or ’) was being provided by the client." @Mark: I think you should tell what the client used was and exactly how you specified the INSERT statement! I also understood that you were inserted the literal '\222' (that escapes to the string constant '222'). But I still think that it is *not strict enough* that (only) a warning is produced here. It should be an ERROR. I also notice that when dumping the table the value is '' (empty string). Was that because MySQL cannot handle an invalid byte in export or because nothing was inserted? No matter what is the case is the case it should (at least) be NULL and not '' (empty string). But better reject the INSERT! What is the purpose of using encoded column types if the server does not protect data against *polution* with invalid byte sequences (applies first of all to utf* family of encodings of course)? If I do this after inserting unhex(92) (forgive my that my idnetifiers are different) Enter password: ******** Welcome to the MySQL monitor. Commands end with ; or \g. Your MySQL connection id is 13 Server version: 5.0.51b-community-nt MySQL Community Edition (GPL) Type 'help;' or '\h' for help. Type '\c' to clear the buffer. mysql> use test2; Database changed mysql> select a from tab0; +------+ | a | +------+ | | +------+ 1 row in set (0.00 sec) mysql> select hex(a) from tab0; +--------+ | hex(a) | +--------+ | | +--------+ 1 row in set (0.00 sec) mysql> .. I am unsure actually what is stored (or not stored)? It is not NULL. What then? It is not NULL but still it seems to be *nothing* .. ?? What is the actual content of that column/row combination in the table after "INSERT ... unhex(92)" ? And btw: Could some MySQL person please tell if my considerations here will be considered or if I will have to create my own report? (I frequently comment to posts by other users and most often nobody bothers to consider contributions/considerations not by the original reporter - so now I would also like to clear guidelines about this!)
[15 May 2008 2:57]
Mark Hershberger
test file to reproduce the bug.
Attachment: bug-36691.txt (text/plain), 167 bytes.
[15 May 2008 3:05]
Mark Hershberger
output of od run on bug file
Attachment: bug-output-36691.txt (text/plain), 1.35 KiB.
[15 May 2008 3:21]
Mark Hershberger
I've attached a file to reproduce this bug. Additionally, I've attached the output of "od -t cx1" when fed that file so that you can verify that you have the same file I uploaded -- that nothing was garbled on upload or download. Piping that file into mysql: > $ mysql mysql < bug-36691.txt > HEX(s) > E28099 > > $ E28099 corresponds to the codepoint U+2019 RIGHT SINGLE QUOTATION MARK, so MySQL's translation ability here from windows codepage to unicode codepoint is great, but inconsistent. Using set names seems to change the behavior.
[15 May 2008 8:05]
Susanne Ebrecht
Many thanks for your feedback. Wow, that's murphy. Which encoding does your test.txt file has? I only get question marks there. Doesn't matter, I understand your problem. My terminal use UTF: mysql> set names utf8; Query OK, 0 rows affected (0.08 sec) mysql> select unhex(92); +-----------+ | unhex(92) | +-----------+ | � | +-----------+ mysql> create table t(t text charset utf8); Query OK, 0 rows affected (0.00 sec) mysql> insert into t values (unhex(92)); Query OK, 1 row affected, 1 warning (0.01 sec) mysql> show warnings; +---------+------+--------------------------------------------------------+ | Level | Code | Message | +---------+------+--------------------------------------------------------+ | Warning | 1366 | Incorrect string value: '\x92' for column 't' at row 1 | +---------+------+--------------------------------------------------------+ I can't see the bug here. The warning is telling you that's not a string value. But that's not a reason to not store this value. Consider, sometimes you need to store non string values too at a database. One simple example: special Printer, like label printers, often use non string values for their setups, so when you want to store different label printer setups for different labels for jam glasses of your jam-fill-in-glasses-production-machine, you need this. Nobody will forbid you to store binary data that are no strings in a char/varchar/text or blob field. And you get a warning, that the sign isn't a string. That's ok in my eyes.