MySQL Bugs: #11614: Clob send as UTF-8 gets truncated if UTF-encoding is longer then string in char

Bug #11614	Clob send as UTF-8 gets truncated if UTF-encoding is longer then string in char
Submitted:	28 Jun 2005 14:16	Modified:	17 Oct 2005 19:42
Reporter:	Roland Littwin	Email Updates:
Status:	Closed	Impact on me:	None
Category:	Connector / J	Severity:	S2 (Serious)
Version:	3.1.10	OS:	Any (All)
Assigned to:	Mark Matthews	CPU Architecture:	Any

Description:
If a Clob is stored, which contains characters, which translated in UTF-8 have more than 1 byte, the additional increase in bytes causes the trailing characters to get truncated.
 E.g. storing the string "öööö" results in "öö" in the database because the 4 characters need 8 bytes as utf8-stream and only 4 byte are stored.

How to repeat:
Create table with clob and UTF-8 encoding and write Clob with non-ASCII characters.

Suggested fix:
In class StringUtils use the following Version which has only one line extra to correct the length of the array:
// if String to UTF-8 conversion length increses for non-ASCII chars
length+=temp.length-c.length; // use difference in arrray-length for difference

	public static final byte[] getBytes(char[] c,
			SingleByteCharsetConverter converter, String encoding,
			String serverEncoding, int offset, int length,
			boolean parserKnowsUnicode) throws SQLException {
		try {
			byte[] b = null;

			if (converter != null) {
				b = converter.toBytes(c, offset, length);
			} else if (encoding == null) {
				byte[] temp = new String(c).getBytes();

				b = new byte[length];
				System.arraycopy(temp, offset, b, 0, length);
			} else {
				String s = new String(c);

				byte[] temp = s.getBytes(encoding);

				// if String to UTF-8 conversion length increses for non-ASCII chars
				length+=temp.length-c.length; // use difference in arrray-length for difference
				b = new byte[length];
				System.arraycopy(temp, offset, b, 0, length);

				if (!parserKnowsUnicode && (encoding.equalsIgnoreCase("SJIS") //$NON-NLS-1$
						|| encoding.equalsIgnoreCase("BIG5") //$NON-NLS-1$
				|| encoding.equalsIgnoreCase("GBK"))) { //$NON-NLS-1$

					if (!encoding.equalsIgnoreCase(serverEncoding)) {
						b = escapeEasternUnicodeByteStream(b, s, offset, length);
					}
				}
			}

			return b;
		} catch (UnsupportedEncodingException uee) {
			throw new SQLException(Messages.getString("StringUtils.10") //$NON-NLS-1$
					+ encoding + Messages.getString("StringUtils.11"),
					SQLError.SQL_STATE_ILLEGAL_ARGUMENT); //$NON-NLS-1$
		}
	}

Could you please write here test case and table definition ?

Here a possible test case.

1. Create table and content

CREATE TABLE `kmng`.`error` (
  `id` INTEGER UNSIGNED NOT NULL AUTO_INCREMENT,
  `text` TEXT NOT NULL,
  PRIMARY KEY(`id`)
)
TYPE = MYISAM
CHARACTER SET utf8 COLLATE utf8_general_ci;

insert into error (id,text) values (1,'');

2. Call the following method and see what it enters. For mysql 4.1.12 it is ööö instead of öööööö as expected.

	public void writeLine(Connection con) throws SQLException {
		Statement st = con.createStatement();
		ResultSet res = st.executeQuery("SELECT text FROM error WHERE id=1");
		if(res.next()) {
			Clob c = res.getClob(1);
			c.truncate(0);
			c.setString(1,"öööööö");
			PreparedStatement stmt = con.prepareStatement ( "UPDATE error SET text = ? WHERE id=1");
			stmt.setClob ( 1, c );
			stmt.executeUpdate();
			stmt.close();
		}
		res.close();
		st.close();
	}

Tested on Win 2000 Sp4 , JDK 1.5, JDBC 3.1.8. Test case is attached

Test case in UTF-8

Attachment: test.java (text/java), 940 bytes.

First fix suggestion does not work correctly for clobs with more then 8192 byte and non-ascii characters. This is because getBytes is called in a loop for 8192 character chunks. So in the last chunk which is less than 8192 the unused charaters stored from the last iteration were also able to increase the byte array creted by encoding. The new fix converts only the first length characters so the problem does not occur.

Fix suggestion that works also for clobs > 8192 bytes and is also faster :-)

	public static final byte[] getBytes(char[] c,
			SingleByteCharsetConverter converter, String encoding,
			String serverEncoding, int offset, int length,
			boolean parserKnowsUnicode) throws SQLException {
		try {
			byte[] b = null;

			if (converter != null) {
				b = converter.toBytes(c, offset, length);
			} else if (encoding == null) {
				byte[] temp = new String(c).getBytes();

				b = new byte[length];
				System.arraycopy(temp, offset, b, 0, length);
			} else {
				String s = new String(c,0,length); // Copy only used part, so old chars from previos iteration don't disturb result

				byte[] temp = s.getBytes(encoding);

				// if String to UTF-8 conversion length increses for non-ASCII chars
				length =temp.length; // new length !!!
				b = new byte[length];
				System.arraycopy(temp, offset, b, 0, length);

				if (!parserKnowsUnicode && (encoding.equalsIgnoreCase("SJIS") //$NON-NLS-1$
						|| encoding.equalsIgnoreCase("BIG5") //$NON-NLS-1$
				|| encoding.equalsIgnoreCase("GBK"))) { //$NON-NLS-1$

					if (!encoding.equalsIgnoreCase(serverEncoding)) {
						b = escapeEasternUnicodeByteStream(b, s, offset, length);
					}
				}
			}

			return b;
		} catch (UnsupportedEncodingException uee) {
			throw new SQLException(Messages.getString("StringUtils.10") //$NON-NLS-1$
					+ encoding + Messages.getString("StringUtils.11"),
					SQLError.SQL_STATE_ILLEGAL_ARGUMENT); //$NON-NLS-1$
		}
	}

I encountered the same problem: a text field corrupted because it contains UTF-8 characters and is longer than 8192 characters.

The problem occurs with at least the 3.1.8 and 3.1.10 versions but it does not with the 3.2.0-alpha version.

The last fix works fine but I think patching the other cases of the surrounding condition would be better (I put a patch here: http://esca.idm.fr/~bozo/mysql/mysql-clob-utf8.patch)

It would nice to have a stable release with this correction.

Regards,

Arnaud.