Bug #108271 String gets wrong encoding while writing to DB
Submitted: 24 Aug 2022 16:12 Modified: 9 Mar 2023 15:37
Reporter: Jan-Peter Wilhelms Email Updates:
Status: Closed Impact on me:
Category:Connector / J Severity:S1 (Critical)
Version:8.0.30 OS:Any
Assigned to: Filipe Silva CPU Architecture:Any

[24 Aug 2022 16:12] Jan-Peter Wilhelms
Writing a string to a column "LONGTEXT COLLATE latin1_general_cs" is broken depending on the client default charset.

How to repeat:
With a mysql server 5.7 I created a table with a column

LONGTEXT COLLATE latin1_general_cs

When inserting a string into this column on a system that has "file.encoding=utf8" through Connector/J the string is stored as utf8 bytes on the server (despite the collation). When retrieving this data the data is corrupted.

With the setting "file.encoding=latin1" storing and reading works fine.

Connector/J in version 8.0.28 did not show this behaviour. It startet with 8.0.29.
[18 Oct 2022 17:01] Filipe Silva
Thank you for your interest in MySQL Connector/J and for reporting this issue.

Can you please provide a test case showing up the issue?
[19 Nov 2022 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[21 Feb 2023 14:38] Jan-Peter Wilhelms
Simple main demonstrating the issue

Attachment: MyMain.java (application/octet-stream, text), 1.63 KiB.

[27 Feb 2023 17:26] Filipe Silva
Thank you Jan-Peter.
[4 Mar 2023 0:03] Filipe Silva
Hi Jean-Peter,

Note that the result of your test in Connector/J 8.0.28 wasn't correct either. You can verify that by switching to server-side prepared statements by adding `useServerPrepStmts=true` to the connection string. The result would be an error and that's the expected correct value in both cases.

This is so because by doing `setCharacterStream()` you are not really sending a true string value to the server but instead a sequence of bytes that represents the string characters encoded in the character set of the client/JVM or in the character set specified in `clobCharacterEncoding`. The server would then take those bytes and decode them in the target column charset (Latin1, in your case), ignoring that they may not have been encoded in that charset to begin with.

In order to make `setCharacterStream()` right you'd have to play with `clobCharacterEncoding` or `caracterEncoding` and make them align with the charset of the target column. There is a bug in this specific part though. I'll set this report as "verified" because of this.

[9 Mar 2023 15:37] Daniel So
Added the following entry to the Connector/J 8.0.33 changelog: 

"For Connector/J 8.0.29 or later, client-side prepared statements failed with the error "Incorrect string value" when setCharacterStream() was used, if the JVM's character encoding did not match the target column's character set. It was because, in that case, the character encoding of the JVM was used when sending character streams to the server, which was unable to decode them as they were expected to be encoded with the column's character set. With this fix, the character set specified by either the connection property clobCharacterEncoding or characterEncoding is used instead, for both client-side and server-side prepared statements, which is the expected behavior."

Also, Section 6.7, Using Character Sets and Unicode, of the MySQL Connector/J 8.0 Developer Guide (https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html) has been updated for more clarifications on the subject. The updates will be visible with the Connector/J 8.0.33 release.