Bug #11614 Clob send as UTF-8 gets truncated if UTF-encoding is longer then string in char
Submitted: 28 Jun 2005 14:16 Modified: 17 Oct 2005 19:42
Reporter: Roland Littwin Email Updates:
Status: Closed Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:3.1.10 OS:Any (All)
Assigned to: Mark Matthews CPU Architecture:Any

[28 Jun 2005 14:16] Roland Littwin
Description:
If a Clob is stored, which contains characters, which translated in UTF-8 have more than 1 byte, the additional increase in bytes causes the trailing characters to get truncated.
 E.g. storing the string "öööö" results in "öö" in the database because the 4 characters need 8 bytes as utf8-stream and only 4 byte are stored.

How to repeat:
Create table with clob and UTF-8 encoding and write Clob with non-ASCII characters.

Suggested fix:
In class StringUtils use the following Version which has only one line extra to correct the length of the array:
// if String to UTF-8 conversion length increses for non-ASCII chars
length+=temp.length-c.length; // use difference in arrray-length for difference

	public static final byte[] getBytes(char[] c,
			SingleByteCharsetConverter converter, String encoding,
			String serverEncoding, int offset, int length,
			boolean parserKnowsUnicode) throws SQLException {
		try {
			byte[] b = null;

			if (converter != null) {
				b = converter.toBytes(c, offset, length);
			} else if (encoding == null) {
				byte[] temp = new String(c).getBytes();

				b = new byte[length];
				System.arraycopy(temp, offset, b, 0, length);
			} else {
				String s = new String(c);

				byte[] temp = s.getBytes(encoding);

				// if String to UTF-8 conversion length increses for non-ASCII chars
				length+=temp.length-c.length; // use difference in arrray-length for difference
				b = new byte[length];
				System.arraycopy(temp, offset, b, 0, length);

				if (!parserKnowsUnicode && (encoding.equalsIgnoreCase("SJIS") //$NON-NLS-1$
						|| encoding.equalsIgnoreCase("BIG5") //$NON-NLS-1$
				|| encoding.equalsIgnoreCase("GBK"))) { //$NON-NLS-1$

					if (!encoding.equalsIgnoreCase(serverEncoding)) {
						b = escapeEasternUnicodeByteStream(b, s, offset, length);
					}
				}
			}

			return b;
		} catch (UnsupportedEncodingException uee) {
			throw new SQLException(Messages.getString("StringUtils.10") //$NON-NLS-1$
					+ encoding + Messages.getString("StringUtils.11"),
					SQLError.SQL_STATE_ILLEGAL_ARGUMENT); //$NON-NLS-1$
		}
	}
[30 Jun 2005 9:29] Vasily Kishkin
Could you please write here test case and table definition ?
[30 Jun 2005 13:19] Roland Littwin
Here a possible test case.

1. Create table and content

CREATE TABLE `kmng`.`error` (
  `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
  `text` TEXT NOT NULL,
  PRIMARY KEY(`id`)
)
TYPE = MYISAM
CHARACTER SET utf8 COLLATE utf8_general_ci;

insert into error (id,text) values (1,'');

2. Call the following method and see what it enters. For mysql 4.1.12 it is ööö instead of öööööö as expected.

	public void writeLine(Connection con) throws SQLException {
		Statement st = con.createStatement();
		ResultSet res = st.executeQuery("SELECT text FROM error WHERE id=1");
		if(res.next()) {
			Clob c = res.getClob(1);
			c.truncate(0);
			c.setString(1,"öööööö");
			PreparedStatement stmt = con.prepareStatement ( "UPDATE error SET text = ? WHERE id=1");
			stmt.setClob ( 1, c );
			stmt.executeUpdate();
			stmt.close();
		}
		res.close();
		st.close();
	}
[4 Jul 2005 15:49] Vasily Kishkin
Tested on Win 2000 Sp4 , JDK 1.5, JDBC 3.1.8. Test case is attached
[4 Jul 2005 15:49] Vasily Kishkin
Test case in UTF-8

Attachment: test.java (text/java), 940 bytes.

[5 Jul 2005 15:06] Roland Littwin
First fix suggestion does not work correctly for clobs with more then 8192 byte and non-ascii characters. This is because getBytes is called in a loop for 8192 character chunks. So in the last chunk which is less than 8192 the unused charaters stored from the last iteration were also able to increase the byte array creted by encoding. The new fix converts only the first length characters so the problem does not occur.

Fix suggestion that works also for clobs > 8192 bytes and is also faster :-)

	public static final byte[] getBytes(char[] c,
			SingleByteCharsetConverter converter, String encoding,
			String serverEncoding, int offset, int length,
			boolean parserKnowsUnicode) throws SQLException {
		try {
			byte[] b = null;

			if (converter != null) {
				b = converter.toBytes(c, offset, length);
			} else if (encoding == null) {
				byte[] temp = new String(c).getBytes();

				b = new byte[length];
				System.arraycopy(temp, offset, b, 0, length);
			} else {
				String s = new String(c,0,length); // Copy only used part, so old chars from previos iteration don't disturb result

				byte[] temp = s.getBytes(encoding);

				// if String to UTF-8 conversion length increses for non-ASCII chars
				length =temp.length; // new length !!!
				b = new byte[length];
				System.arraycopy(temp, offset, b, 0, length);

				if (!parserKnowsUnicode && (encoding.equalsIgnoreCase("SJIS") //$NON-NLS-1$
						|| encoding.equalsIgnoreCase("BIG5") //$NON-NLS-1$
				|| encoding.equalsIgnoreCase("GBK"))) { //$NON-NLS-1$

					if (!encoding.equalsIgnoreCase(serverEncoding)) {
						b = escapeEasternUnicodeByteStream(b, s, offset, length);
					}
				}
			}

			return b;
		} catch (UnsupportedEncodingException uee) {
			throw new SQLException(Messages.getString("StringUtils.10") //$NON-NLS-1$
					+ encoding + Messages.getString("StringUtils.11"),
					SQLError.SQL_STATE_ILLEGAL_ARGUMENT); //$NON-NLS-1$
		}
	}
[20 Jul 2005 15:02] Arnaud de Bossoreille
I encountered the same problem: a text field corrupted because it contains UTF-8 characters and is longer than 8192 characters.

The problem occurs with at least the 3.1.8 and 3.1.10 versions but it does not with the 3.2.0-alpha version.

The last fix works fine but I think patching the other cases of the surrounding condition would be better (I put a patch here: http://esca.idm.fr/~bozo/mysql/mysql-clob-utf8.patch)

It would nice to have a stable release with this correction.

Regards,

Arnaud.