Bug #8317 Character set introducer in query fails to override connection character set
Submitted: 4 Feb 2005 13:56 Modified: 20 Jul 2005 1:04
Reporter: John Percival
Status: Closed
Category:Server Severity:S2 (Serious)
Version:4.1 OS:Any (All)
Assigned to: Georg Richter Target Version:

[4 Feb 2005 13:56] John Percival
Description:
Real world case: trying to insert binary data into a table. 
Test case: as below, selecting a small about of binary data

The issue appears to be that MySQL does not recognise that the column is being forced to
have the character set 'latin1', so sees ¿\ as one character, leaving it to kick out an
error when it comes to ''.

How to repeat:
SET CHARACTER SET 'gbk';
SELECT _latin1 '¿\'';

PHP code:
mysql_query("SET CHARACTER SET 'gbk'");
$string = "¿'";
mysql_query("SELECT _latin1 '" . mysql_real_escape_string($string) . "'");
echo mysql_error();
[5 Feb 2005 8:15] Aleksey Kishkin
E:\mysql41\bin>mysql.exe -u root -p test
Enter password: ******
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1 to server version: 4.1.9-nt-max

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> SET CHARACTER SET 'gbk';
Query OK, 0 rows affected (0.11 sec)

mysql> SELECT _latin1 '?\'';
+----+
| ?' |
+----+
| ?' |
+----+
1 row in set (0.01 sec)

mysql>
[5 Feb 2005 8:33] Aleksey Kishkin
question mark seems to be converted to original '?' during the copy&paste but I tested it
in mysql.exe and in query browser with proper character 

set character set gbk;
insert into a values(_latin1'¿\'');
select from a;
[5 Feb 2005 16:04] John Percival
This is the output I am receiving:

C:\Program Files\mysql\MySQL Server 4.1\bin>mysql -u root -p
Enter password: *********
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 674 to server version: 4.1.9-nt

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.
mysql> show columns from a;
+-------+----------+------+-----+---------+-------+
| Field | Type     | Null | Key | Default | Extra |
+-------+----------+------+-----+---------+-------+
| datk  | char(10) | YES  |     | NULL    |       |
+-------+----------+------+-----+---------+-------+
1 row in set (0.00 sec)

mysql> set character set 'gbk';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into a values(_latin1'¿\'');
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near ''¿\''
)' at line 1
mysql> insert into a values('¿\'');
ERROR 1064 (42000): You have an error in your SQL syntax; check the manual that
corresponds to your MySQL server version for the right syntax to use near ''¿\''
)' at line 1
mysql>

I'm not sure what we're doing differently?
[7 Feb 2005 12:50] John Percival
Please could a dev reopen this bug? I believe that it is repeatable.
[7 Feb 2005 15:36] Aleksey Kishkin
reproduced this error on linux slackware  (4.1.9) and on win2003 EE.  
was not reproduced it on winxp. 

PS it happens on character set 'gbk' and ' sjis', works fine on cp1251 koi8 utf8.
[7 Feb 2005 15:44] John Percival
I think that the reason that this is happening is that ¿\ is a valid double-byte
character in gbk (and other char sets) whereas ¿' is not. So when ¿' gets escaped to
¿\', it is interpreting ¿\ as one character, and then gets puzzled about the following
', which is not escaped at all. It's not possible to just use ¿' since ¿ is interpreted
as a single character, which then causes problems when the parser encounters the '.
So neither of the following statements will work:
SELECT _latin1'¿\'';
SELECT _latin1'¿'';
[7 Feb 2005 16:47] Sergei Golubchik
There are two different issues here.

First - mysql_real_escape_string uses character set from mysql_real_connect.
It is not affected by mysql_query("SET CHARACTER SET").
So, your mysql_real_escape_string() escapes in latin1. Of course the result
is wrong in gbk. (it's not easy to fix, as it requires changes in the protocol -
but still we're going to fix it)

Second - _latin1 before the string does not change anything for the parser,
because the statement must be parsed first to see _latin1 introducer. That
is, the string is parsed as gbk string  - and if it is invalid gbk string, no
introducer will help. It cannot be fixed, to assign a spesific meaning to a string of
characters, it must be parsed first, so charset must be known in advance.
[7 Feb 2005 16:57] John Percival
Thanks for the info. Can you suggest some kind of work-around? At the moment, the server
charset is gbk (we've just been using SET CHARACTER SET to model the situation, rather
than having to change my.cnf files), and we want to insert binary data into the table.
For example, how would we insert ¿' into a table when the server/connection is running
in gbk mode? Do we need to do something messy like CONCAT( '¿', "'" ) ?

As far as I can tell from what you say (which is what I had been fearing the answer would
be), it's not possible to reliably use binary data in queries when running in gbk mode. Is
that the case?
[7 Feb 2005 18:35] Sergei Golubchik
A bullet-proof method (which is used by recent mysqldump) is to use hex notation:
0x0123AB9...... and so on.

There is MySQL C API function to do it - mysql_hex_string(),
but I am not sure there's PHP counterpart for it :(
[7 Feb 2005 18:44] John Percival
There is the PHP bin2hex() function, which I think achieves a similar thing.
http://uk.php.net/manual/en/function.bin2hex.php
The main issue that we came across when considering this option is that data run through
that function is approximately 25% bigger on average, which basically means that the you
can only insert about 80% of max_packet at one time.
[7 Feb 2005 19:18] John Percival
To go back to your original point about that _latin1 not having any effect on the
parser...surely this is an issue? Otherwise, what is the point of having the _latin1
introducer, if it doesn't affect the string that follows it to make it parsed as a latin1
string rather than a gbk string...surely once the parser has come across the introducer,
it should switch itself into latin1 'mode' in order to determine the string that follows
it, and then switch back out of it once the complete string has been identified. That
would be how I would expect it to work and that would certainly fix the issues above.
[7 Feb 2005 21:11] Freddie Bingham
Your contention is that mysql_real_escape_string is going off of the character set that
was used to connect with, being latin1 by default.  Is the assumption then if I was to
connect with gbk, then this insert would function? It doesn't as far as my tests go.

Example:

c:\windows\my.ini:

[client]
port=3306
default-character-set=gbk

PHP Code:

$mysqli = mysqli_init();
$mysqli->options(MYSQLI_READ_DEFAULT_FILE, "C:\windows\my.ini");
$mysqli->real_connect('localhost', 'user', 'password', 'database');
echo $mysqli->character_set_name();

This returns:

gbk_chinese_ci

Insertion code (field is mediumblob):

$string = "'";
UPDATE field SET filedata = '" . $mysqli->real_escape_string($string) . "'

Result:

Errormessage: You have an error in your SQL syntax; check the manual that corresponds to
your MySQL server version for the right syntax to use near ''\''' at line 2 

This is the same error that is reported when the client character set is not set and
defaults to latin_swedish_ci
[7 Feb 2005 23:13] Sergei Golubchik
John - we're working on a new parser, to replace current LALR(1) one
generated by yacc. I think it's not possible to do what you want with
the yacc parser, because of the way it works (with look-aheads) - when
it executes the code block for _latin1 (this is the moment MySQL knows
there is an introducer), parser may've already looked ahead and parsed
the next token - that is string constant.

With the new recursive-descent parser it could be possible (perhaps).

With the current parser, a string must be a valid string in the client
charset. Only after parsing it is interpreted as a string in the
charset specified by the introducer.

===

Freddie, it could be. What is the result of
bin2hex(mysql_real_escape_string(your_string))
if the connection charset is gbk ?

May be it's a bug in escaping of multi-byte characters ?
[8 Feb 2005 0:59] Freddie Bingham
bin2hex(mysql_real_escape_string(your_string))

output is:

efbbbf5c27

Are you implying that the bug exists in PHP or in MySQL?
[8 Feb 2005 8:43] Sergei Golubchik
In MySQL, in client library.

When 0xefbbbf27 is escaped in gbk mode it's wrong to put a backslash after 0xbf and
before a single quote, because 0xbf is a start of a multi-byte character.
[8 Feb 2005 12:03] John Percival
Thanks for verifying the bug.

Out of interest, how should that string be escaped?

Because 0xbf27 is not a valid multi-byte character, the parser sees 0x27 (i.e. ') as the
end of the string. But as you rightly point out, 0xbf5c is a valid multi-byte character,
so it is counted as such, and then the parser meets the unescaped 0x27 (i.e. ') and
thinks that it has reached the end of the string. So it seems to be impossible to use the
literal string 0xbf27 (i.e. ¿') in queries when in gbk mode. Surely that can't be right?
[8 Feb 2005 17:49] Freddie Bingham
Just a note that the original test was done against php 5.0.3 using the 4.1.7 client
library. I just updated libMySQL.dll from 4.1.9 and have the same results.

mysqli
Client API version  4.1.9
[9 Feb 2005 0:10] Jim Winstead
The issue that Freddie Bingham has raised was filed as a new bug (#8378). The original
reported bug will require changes to the client protocol and possibly parser, as Sergei
noted in his earlier comment.
[9 Feb 2005 14:48] John Percival
I think that this is probably quite a major issue. Take for example, the following SQL
query:

SELECT * FROM user WHERE username = '¿\' OR userid = 1 -- '

That could easily be produced by:
$username="¿' OR userid = 1 -- ";
mysql_query("SELECT * FROM USER WHERE username = _latin1'" .
mysql_real_escape_string($username) . "'");

What I'm saying is that this bug opens the door to SQL injection attacks, when running in
GBK and some other multi-byte charsets.

I have posted this comment in bugs #8317 and #8378 as I think that it's pertinent to
both: mysql_real_escape_string() needs to throw an error if it can't properly escape the
string in the current charset; the parser needs to be able to use the _latin1 introducer
to determine how it parses the following literal string.
[5 Apr 2005 15:25] Sergei Golubchik
protocol part of this bugreport is moved to http://bugs.mysql.com/bug.php?id=9637
[15 Apr 2005 5:31] xuefer tinys
as i said in another bug(dup of this one), adding new connection option should be much
simple, something like BINARY_ESCAPE, escape/parse all data in binary mode using \
regardless of charset. it's the simplest way to "copy" data from client->server.

is there any benefit to escaping the string in non-binary way.
[12 Jul 2005 17:55] Georg Richter
PHP 5.1 beta and PHP 5.0 (latest version) provide a new function mysqli_set_charset which
fixes this 
behaviour.

mysqli_set_charset uses the libmysql api function mysql_set_character_set which was added
to MySQL 4.1.13 and MySQL 5.1.10.

For examples please check:
http://cvs.php.net/co.php/php-src/ext/mysqli/tests/065.phpt?r=1.3
http://www.php.net/manual/en/function.mysqli-set-charset.php
[20 Jul 2005 1:04] Paul DuBois
mysql_set_character_set() now is documented here:

http://dev.mysql.com/doc/mysql/en/mysql-set-character-set.html