Bug #3511 bug in com.mysql.jdbc.StringUtils.java escapeSJISByteStream()
Submitted: 19 Apr 2004 22:48 Modified: 24 Apr 2004 8:51
Reporter: Shijie Chen Email Updates:
Status: Closed Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:until 3.0/3.1-nightly-2004041 OS:Any (all)
Assigned to: CPU Architecture:Any

[19 Apr 2004 22:48] Shijie Chen
Description:
escapeSJISByteStream() is used to escape the '0x5c' in the high byte of double-byte characters such as GBK, BIG5, SJIS.

I think the author may be not fimilar with GBK and BIG5.  

GBK characer set
GBK/2: B0A1-F7FE CJK UNIFIED IDEOGRAPH
GBK/3: 8140-A0FE CJK UNIFIED IDEOGRAPH
GBK/4: AA40-FEA0 CJK UNIFIED IDEOGRAPH
GBK/1: A1A1-A9FE symbol
GBK/5: A840-A9A0 symbol
gb2312 is subset of GBK, gb2312=GBK/1 + GBK/2

At com.mysql.jdbc.StringUtils.java line 311-312
if (((loByte >= 0x81) && (loByte <= 0x9F))
                        || ((loByte >= 0xE0) && (loByte <= 0xFC))) {

It is not contain the whole GBK characer set.
if we use the conection url 
"jdbc:mysql://localhost/test?useUnicode=true&characterEncoding=GBK"

when we insert a string field which contains chinese characers, there will be a StringIndexOutOfBoundsException.

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 4
        at java.lang.String.charAt(String.java:460)
        at com.mysql.jdbc.StringUtils.escapeSJISByteStream StringUtils.java:280)
        at com.mysql.jdbc.StringUtils.getBytes(StringUtils.java:105)
        at com.mysql.jdbc.PreparedStatement.setString(PreparedStatement.java:1068)
        at TestSql.main(TestSql.java:19)

How to repeat:
CREATE TABLE mytable (name VARCHAR(20));

import java.sql.*;

public class TestSql {
	
	public static String dbDriver = "com.mysql.jdbc.Driver";
	public static String dbURL = "jdbc:mysql://localhost/test?useUnicode=true&characterEncoding=GBK"; // use gb2321 instead GBK no problem, because needn't call escapeSJISByteStream().
	public static String user = "root";
	public static String password = "root";
	
	public static void main(String[] args) 
	throws ClassNotFoundException, SQLException {
		
		Class.forName(dbDriver);
		Connection conn = DriverManager.getConnection(dbURL, user, password);
		
		PreparedStatement stmt = conn.prepareStatement(
			"insert into mytable (name) values ( ? )");
		stmt.setString(1, "\u4e2d\u6587"); //two chinese characters
		//stmt.setString(1, "abcd"); if insert "abcd" no problem
		stmt.execute();
		stmt.close();
		
		conn.close();

	}
}

Exception in thread "main" java.lang.StringIndexOutOfBoundsException: String index out of range: 4
        at java.lang.String.charAt(String.java:460)
        at com.mysql.jdbc.StringUtils.escapeSJISByteStream StringUtils.java:280)
        at com.mysql.jdbc.StringUtils.getBytes(StringUtils.java:105)
        at com.mysql.jdbc.PreparedStatement.setString(PreparedStatement.java:1068)
        at TestSql.main(TestSql.java:19)

Suggested fix:
very easy

com.mysql.jdbc.StringUtils.java line 311-312
if (((loByte >= 0x81) && (loByte <= 0x9F))
                        || ((loByte >= 0xE0) && (loByte <= 0xFC))) {

replace it with 
if (loByte >= 0x80) {.... 
everything is ok.

For double-byte characters such as GBK, BIG5, the high bit of loByte is always '1', it seperates double-byte characters from
the standard ASCII.

in addition, escaping the '0x5c' dose not need the origin String. The above rule is enough.

For this reason, many Chinese users still use the mm.mysql. 
mm.mysql. have a problem in deal with the 0x5c.
The high byte of some characters is 0x5c, but these characters are seldom used.

Thank you for Chinese users.
[24 Apr 2004 8:51] Mark Matthews
Thank you for your bug report. This issue has been committed to our
source repository of that product and will be incorporated into the
next release.

If necessary, you can access the source repository and build the latest
available version, including the bugfix, yourself. More information 
about accessing the source trees is available at
    http://www.mysql.com/doc/en/Installing_source_tree.html