Bug #8317 Character set introducer in query fails to override connection character set
Submitted: 4 Feb 2005 12:56 Modified: 19 Jul 2005 23:04
Reporter: John Percival Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:4.1 OS:Any (All)
Assigned to: Georg Richter

[4 Feb 2005 12:56] John Percival
Description:
Real world case: trying to insert binary data into a table. 
Test case: as below, selecting a small about of binary data

The issue appears to be that MySQL does not recognise that the column is being forced to have the character set 'latin1', so sees ¿\ as one character, leaving it to kick out an error when it comes to ''.

How to repeat:
SET CHARACTER SET 'gbk';
SELECT _latin1 '¿\'';

PHP code:
mysql_query("SET CHARACTER SET 'gbk'");
$string = "¿'";
mysql_query("SELECT _latin1 '" . mysql_real_escape_string($string) . "'");
echo mysql_error();
[5 Feb 2005 7:15] Aleksey Kishkin
E:\mysql41\bin>mysql.exe -u root -p test
Enter password: ******
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1 to server version: 4.1.9-nt-max

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> SET CHARACTER SET 'gbk';
Query OK, 0 rows affected (0.11 sec)

mysql> SELECT _latin1 '?\'';
+----+
| ?' |
+----+
| ?' |
+----+
1 row in set (0.01 sec)

mysql>
[5 Feb 2005 7:33] Aleksey Kishkin
question mark seems to be converted to original '?' during the copy&paste but I tested it in mysql.exe and in query browser with proper character 

set character set gbk;
insert into a values(_latin1'¿\'');
select from a;
[5 Feb 2005 15:04] John Percival
This is the output I am receiving:

C:\Program Files\mysql\MySQL Server 4.1\bin>mysql -u root -p
Enter password: *********
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 674 to server version: 4.1.9-nt

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> show columns from a;
+-------+----------+------+-----+---------+-------+
| Field | Type     | Null | Key | Default | Extra |
+-------+----------+------+-----+---------+-------+
| datk  | char(10) | YES  |     | NULL    |       |
+-------+----------+------+-----+---------+-------+
1 row in set (0.00 sec)

mysql> set character set 'gbk';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into a values(_latin1'¿\'');
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near ''¿\''
)' at line 1
mysql> insert into a values('¿\'');
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near ''¿\''
)' at line 1
mysql>

I'm not sure what we're doing differently?
[7 Feb 2005 11:50] John Percival
Please could a dev reopen this bug? I believe that it is repeatable.
[7 Feb 2005 14:36] Aleksey Kishkin
reproduced this error on linux slackware  (4.1.9) and on win2003 EE.  
was not reproduced it on winxp. 

PS it happens on character set 'gbk' and ' sjis', works fine on cp1251 koi8 utf8.
[7 Feb 2005 14:44] John Percival
I think that the reason that this is happening is that ¿\ is a valid double-byte character in gbk (and other char sets) whereas ¿' is not. So when ¿' gets escaped to ¿\', it is interpreting ¿\ as one character, and then gets puzzled about the following ', which is not escaped at all. It's not possible to just use ¿' since ¿ is interpreted as a single character, which then causes problems when the parser encounters the '.
So neither of the following statements will work:
SELECT _latin1'¿\'';
SELECT _latin1'¿'';
[7 Feb 2005 15:47] Sergei Golubchik
There are two different issues here.

First - mysql_real_escape_string uses character set from mysql_real_connect.
It is not affected by mysql_query("SET CHARACTER SET").
So, your mysql_real_escape_string() escapes in latin1. Of course the result
is wrong in gbk. (it's not easy to fix, as it requires changes in the protocol -
but still we're going to fix it)

Second - _latin1 before the string does not change anything for the parser,
because the statement must be parsed first to see _latin1 introducer. That
is, the string is parsed as gbk string  - and if it is invalid gbk string, no
introducer will help. It cannot be fixed, to assign a spesific meaning to a string of characters, it must be parsed first, so charset must be known in advance.
[7 Feb 2005 15:57] John Percival
Thanks for the info. Can you suggest some kind of work-around? At the moment, the server charset is gbk (we've just been using SET CHARACTER SET to model the situation, rather than having to change my.cnf files), and we want to insert binary data into the table. For example, how would we insert ¿' into a table when the server/connection is running in gbk mode? Do we need to do something messy like CONCAT( '¿', "'" ) ?

As far as I can tell from what you say (which is what I had been fearing the answer would be), it's not possible to reliably use binary data in queries when running in gbk mode. Is that the case?
[7 Feb 2005 17:35] Sergei Golubchik
A bullet-proof method (which is used by recent mysqldump) is to use hex notation: 0x0123AB9...... and so on.

There is MySQL C API function to do it - mysql_hex_string(),
but I am not sure there's PHP counterpart for it :(
[7 Feb 2005 17:44] John Percival
There is the PHP bin2hex() function, which I think achieves a similar thing.
http://uk.php.net/manual/en/function.bin2hex.php
The main issue that we came across when considering this option is that data run through that function is approximately 25% bigger on average, which basically means that the you can only insert about 80% of max_packet at one time.
[7 Feb 2005 18:18] John Percival
To go back to your original point about that _latin1 not having any effect on the parser...surely this is an issue? Otherwise, what is the point of having the _latin1 introducer, if it doesn't affect the string that follows it to make it parsed as a latin1 string rather than a gbk string...surely once the parser has come across the introducer, it should switch itself into latin1 'mode' in order to determine the string that follows it, and then switch back out of it once the complete string has been identified. That would be how I would expect it to work and that would certainly fix the issues above.
[7 Feb 2005 20:11] Freddie Bingham
Your contention is that mysql_real_escape_string is going off of the character set that was used to connect with, being latin1 by default.  Is the assumption then if I was to connect with gbk, then this insert would function? It doesn't as far as my tests go.

Example:

c:\windows\my.ini:

[client]
port=3306
default-character-set=gbk

PHP Code:

$mysqli = mysqli_init();
$mysqli->options(MYSQLI_READ_DEFAULT_FILE, "C:\windows\my.ini");
$mysqli->real_connect('localhost', 'user', 'password', 'database');
echo $mysqli->character_set_name();

This returns:

gbk_chinese_ci

Insertion code (field is mediumblob):

$string = "'";
UPDATE field SET filedata = '" . $mysqli->real_escape_string($string) . "'

Result:

Errormessage: You have an error in your SQL syntax; check the manual that corresponds to your MySQL server version for the right syntax to use near ''\''' at line 2 

This is the same error that is reported when the client character set is not set and defaults to latin_swedish_ci
[7 Feb 2005 22:13] Sergei Golubchik
John - we're working on a new parser, to replace current LALR(1) one
generated by yacc. I think it's not possible to do what you want with
the yacc parser, because of the way it works (with look-aheads) - when
it executes the code block for _latin1 (this is the moment MySQL knows
there is an introducer), parser may've already looked ahead and parsed
the next token - that is string constant.

With the new recursive-descent parser it could be possible (perhaps).

With the current parser, a string must be a valid string in the client
charset. Only after parsing it is interpreted as a string in the
charset specified by the introducer.

===

Freddie, it could be. What is the result of
bin2hex(mysql_real_escape_string(your_string))
if the connection charset is gbk ?

May be it's a bug in escaping of multi-byte characters ?
[7 Feb 2005 23:59] Freddie Bingham
bin2hex(mysql_real_escape_string(your_string))

output is:

efbbbf5c27

Are you implying that the bug exists in PHP or in MySQL?
[8 Feb 2005 7:43] Sergei Golubchik
In MySQL, in client library.

When 0xefbbbf27 is escaped in gbk mode it's wrong to put a backslash after 0xbf and before a single quote, because 0xbf is a start of a multi-byte character.
[8 Feb 2005 11:03] John Percival
Thanks for verifying the bug.

Out of interest, how should that string be escaped?

Because 0xbf27 is not a valid multi-byte character, the parser sees 0x27 (i.e. ') as the end of the string. But as you rightly point out, 0xbf5c is a valid multi-byte character, so it is counted as such, and then the parser meets the unescaped 0x27 (i.e. ') and thinks that it has reached the end of the string. So it seems to be impossible to use the literal string 0xbf27 (i.e. ¿') in queries when in gbk mode. Surely that can't be right?
[8 Feb 2005 16:49] Freddie Bingham
Just a note that the original test was done against php 5.0.3 using the 4.1.7 client library. I just updated libMySQL.dll from 4.1.9 and have the same results.

mysqli
Client API version  4.1.9
[8 Feb 2005 23:10] Jim Winstead
The issue that Freddie Bingham has raised was filed as a new bug (#8378). The original reported bug will require changes to the client protocol and possibly parser, as Sergei noted in his earlier comment.
[9 Feb 2005 13:48] John Percival
I think that this is probably quite a major issue. Take for example, the following SQL query:

SELECT * FROM user WHERE username = '¿\' OR userid = 1 -- '

That could easily be produced by:
$username="¿' OR userid = 1 -- ";
mysql_query("SELECT * FROM USER WHERE username = _latin1'" . mysql_real_escape_string($username) . "'");

What I'm saying is that this bug opens the door to SQL injection attacks, when running in GBK and some other multi-byte charsets.

I have posted this comment in bugs #8317 and #8378 as I think that it's pertinent to both: mysql_real_escape_string() needs to throw an error if it can't properly escape the string in the current charset; the parser needs to be able to use the _latin1 introducer to determine how it parses the following literal string.
[5 Apr 2005 13:25] Sergei Golubchik
protocol part of this bugreport is moved to http://bugs.mysql.com/bug.php?id=9637
[15 Apr 2005 3:31] xuefer tinys
as i said in another bug(dup of this one), adding new connection option should be much simple, something like BINARY_ESCAPE, escape/parse all data in binary mode using \ regardless of charset. it's the simplest way to "copy" data from client->server.

is there any benefit to escaping the string in non-binary way.
[12 Jul 2005 15:55] Georg Richter
PHP 5.1 beta and PHP 5.0 (latest version) provide a new function mysqli_set_charset which fixes this 
behaviour.

mysqli_set_charset uses the libmysql api function mysql_set_character_set which was added to MySQL 4.1.13 and MySQL 5.1.10.

For examples please check:
http://cvs.php.net/co.php/php-src/ext/mysqli/tests/065.phpt?r=1.3
http://www.php.net/manual/en/function.mysqli-set-charset.php
[19 Jul 2005 23:04] Paul Dubois
mysql_set_character_set() now is documented here:

http://dev.mysql.com/doc/mysql/en/mysql-set-character-set.html