Bug #105449 connectionCollation ignored if characterEncoding is set
Submitted: 3 Nov 2021 15:50 Modified: 20 Dec 2022 17:45
Reporter: Andreas Sachs Email Updates:
Status: Closed Impact on me:
None 
Category:Connector / J Severity:S3 (Non-critical)
Version:8.0.27 OS:Linux
Assigned to: Filipe Silva CPU Architecture:Any

[3 Nov 2021 15:50] Andreas Sachs
Description:
Server: 5.7.36
Column: latin1_swedish_ci
ConnectorJ 8.0.27

connectionCollation is ignored if you specify characterEncoding.

I think this is a bug because according to the documentation connectionCollation should override characterEncoding.

connectionCollation: ... This property overrides the value of 'characterEncoding' with the character set this collation belongs to.

How to repeat:
Teststring: Žćaaa

8.0.22 without characterEncoding property : Ž?aaa (= Value inserted in the DB)
8.0.22 with characterEncoding property latin1: ??aaa
8.0.22 with characterEncoding property latin1 and connectionCollation property: Ž?aaa

8.0.27 without characterEncoding property : Incorrect string value
8.0.27 with characterEncoding property latin1: ??aaa
8.0.27 with characterEncoding property latin1 and connectionCollation property: ??aaa   <====== i think this is a bug, because connectionCollation is not respected. I would expect Ž?aaa
8.0.27 without characterEncoding property and with connectionCollation property: Ž?aaa
[4 Nov 2021 13:41] Filipe Silva
Hi Andreas,

Thank you for your interest in MySQL and Connector/J.

When creating connections Connector/J has to infer the correct character set and collation to set in the connection. While doing so it also tries its best to follow the user's instructions.

Having said that, let me go through each one of your scenarios.

> 8.0.27 without characterEncoding property : Incorrect string value

Are you sure about this? Without any setting the driver uses UTF-8 by default which is able to handle all characters. Can you provide a test case for this?

> 8.0.27 with characterEncoding property latin1: ??aaa

With "characterEncoding=latin1" Connector/J infers that the connection collation is "latin1_swedish_ci", i.e., the default collation for latin1 in the server. As to the Java character set, "ISO-8859-1" will be used since "latin1" is one of its known aliases.

> 8.0.27 with characterEncoding property latin1 and connectionCollation property: ??aaa

With "characterEncoding=latin1&connectionCollation=latin1_swedish_ci" Connector/J rightfully takes the connection collation setting, which has priority over the character encoding, and sets it in the connection. The corresponding character set in the same connection is "latin1". As to the Java character set to use, it will be "ISO-8859-1" since it is a correct match for the character encoding "latin1" you have specified in the connection string. Essentially the same as before.

> 8.0.27 without characterEncoding property and with connectionCollation property: Ž?aaa

With "connectionCollation=latin1_swedish_ci" alone, Connector/J infers the character set to use in the connection. This will be "latin1", i.e., the character set that corresponds to the specified collation. There's a catch with regard to the Java character set it will use, though. Historically, Connector/J used "windows-1252" as the Java character set corresponding to MySQL's latin1, as such and in order to avoid breaking existing applications, this is still the default mapping today. So, since no characterEncoding has been explicitly specified, Connector/J has no option than to use the default mapping for MySQL's "latin1" character set - "windows-1252". Also mind that "windows-1252" and "ISO-8859-1" are similar, but not the same.

The difference between 8.0.27 and 8.0.22 is that the newer version tries to respect the configurations specified by the user in the connection string, as long as they are compatible.

Some examples:

Setting "characterEncoding=windows-1252&connectionCollation=latin1_swedish_ci" or "characterEncoding=cp1252&connectionCollation=latin1_swedish_ci" would have the same effect as the fourth case you presented. In these connection strings the values for characterEncoding are valid names/aliases for "windows-1252".

Setting "characterEncoding=utf-8&connectionCollation=latin1_swedish_ci" would have the same outcome since "utf-8" is not a valid mapping for MySQL's "latin1" (inferred from the collation "latin1_swedish_ci"). As a result Connector/J falls back to its Java character set default mapping, i.e., "windows-1252".

Setting "characterEncoding=CP819&connectionCollation=latin1_swedish_ci", on the other hand, would have the same outcome as your second or third cases because, again, "CP819" is an alias for "ISO-8859-1".

I hope this clarifies it. If you have further comments, please reopen this bug or file a new one and mention this one in the description.

Thanks,
[4 Nov 2021 14:25] Andreas Sachs
Hello, 
thanks for your explanation.

>> 8.0.27 without characterEncoding property : Incorrect string value
> Are you sure about this? Without any setting the driver uses UTF-8 by default which is able to handle all characters. Can you provide a test case for this?
I don't have a test case. But maybe the stacktrace helps:

org.hibernate.exception.GenericJDBCException: could not execute statement
	at org.hibernate.exception.internal.StandardSQLExceptionConverter.convert(StandardSQLExceptionConverter.java:42)
	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:113)
	at org.hibernate.engine.jdbc.spi.SqlExceptionHelper.convert(SqlExceptionHelper.java:99)
	....

Caused by: java.sql.SQLException: Incorrect string value: '\xC4\x87' for column 'name' at row 1
	at com.mysql.cj.jdbc.exceptions.SQLError.createSQLException(SQLError.java:129)
	at com.mysql.cj.jdbc.exceptions.SQLExceptionsMapping.translateException(SQLExceptionsMapping.java:122)
	at com.mysql.cj.jdbc.ClientPreparedStatement.executeInternal(ClientPreparedStatement.java:953)
	at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1098)
	at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdateInternal(ClientPreparedStatement.java:1046)
	at com.mysql.cj.jdbc.ClientPreparedStatement.executeLargeUpdate(ClientPreparedStatement.java:1371)
	at com.mysql.cj.jdbc.ClientPreparedStatement.executeUpdate(ClientPreparedStatement.java:1031)
	at sun.reflect.GeneratedMethodAccessor134.invoke(Unknown Source)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.apache.tomcat.jdbc.pool.StatementFacade$StatementProxy.invoke(StatementFacade.java:118)
	at com.sun.proxy.$Proxy182.executeUpdate(Unknown Source)
	at org.hibernate.engine.jdbc.internal.ResultSetReturnImpl.executeUpdate(ResultSetReturnImpl.java:197)

>8.0.27 with characterEncoding property latin1 and connectionCollation property: ??aaa   <====== i think this is a bug, because connectionCollation is not respected. I would expect Ž?aaa
>8.0.27 without characterEncoding property and with connectionCollation property: Ž?aaa

Maybe it's a problem of the documentation:

"connectionCollation: ... This property overrides the value of 'characterEncoding' with the character set this collation belongs to."

If connectionCollation overrides characterEncoding i would expect that characterEncoding is ignored
[4 Nov 2021 15:01] Filipe Silva
Ok. I agree the documentation could be more explicit regarding the real behavior.

Regarding the error above. Can you extract the query being executed and the table definition?
[4 Nov 2021 15:19] Filipe Silva
Actually, I don't need your query neither the table structure. Let me dig into this a little deeper.
[4 Nov 2021 15:57] Andreas Sachs
I removed some columns. 

CREATE DATABASE IF NOT EXISTS `dbt_202111021711` /*!40100 DEFAULT CHARACTER SET latin1 COLLATE latin1_german2_ci */;
USE `dbt_202111021711`;

CREATE TABLE IF NOT EXISTS `tb_user` (
  `userId` int(11) NOT NULL AUTO_INCREMENT,
  `username` varchar(50) DEFAULT NULL,
  `name` varchar(255) NOT NULL,
  `vorname` varchar(255) NOT NULL,
  `email` varchar(255) DEFAULT NULL,
  `geburtsdatum` date DEFAULT NULL,
  
  PRIMARY KEY (`userId`),
  UNIQUE KEY `username` (`username`),
  KEY `name` (`name`,`vorname`,`geburtsdatum`),
  KEY `vorname` (`vorname`,`geburtsdatum`)
) ENGINE=MyISAM AUTO_INCREMENT=107124 DEFAULT CHARSET=latin1;

   insert 
    into
        tb_user
        (  email,  geburtsdatum,  name, username,  vorname) 
    values
        (?, ?, ?, ?, ?)
[4 Nov 2021 17:12] Filipe Silva
Hi Andreas,

The problem you see there is because the character 'ć', encoded in UTF-8 as "\xC4\x87" has no representation in Latin1.

You can even see this by running the following in a mysql client using UTF-8:

mysql> select convert('ć' using latin1);
+----------------------------+
| convert('ć' using latin1)  |
+----------------------------+
| ?                          |
+----------------------------+
1 row in set, 1 warning (0,01 sec)

And this is also why you always see a question mark in place of this character in your examples above, even in the last case where the character 'Ž' was displayed correctly.

Now, the situation with 'Ž' is different because this character exists in Windows-1252 but not in ISO-8859-1. And this is why you can see it when the Java character set is set as the first but not with the second. Actually, in one case Connector/J is able to send to the server this character's code-point while in the other case it sends the code point for the question mark, obviously because there's no code-point for such character.

If you'd like to see more details about this matter these two pages explain very well the differences between the character sets we are talking about:
- https://www.i18nqa.com/debug/table-iso8859-1-vs-windows-1252.html
- https://www.i18nqa.com/debug/table-iso8859-1-vs-iso8859-15.html

As a final note, MySQL's latin1 matches cp1252 (https://dev.mysql.com/doc/refman/8.0/en/charset-we-sets.html), which is why the character 'Ž' can be encoded correctly in a latin1 table/column, as long as the client application is able to encode it too.

Conclusion:

1. When you "characterEncoding=latin1" you are instructing Connector/J to use Java's Latin1 (A.K.A. ISO-8859-1) character set.

2. The character 'ć' has no representation in any of the involved character sets so it is always replaced by ? or by an error when trying to convert from its UTF-8 encoding.

3. If you want to align the Java character set used by Connector/J with MySQL's latin1 you must: either not set a value for "characterEncoding" or set it as "characterEncoding=windows-1252", or one of its aliases, such as "characterEncoding=cp1252".

I'll set this bug as "verified" but only to get us improving the description of the property "connectionCollation".

Thank you
[4 Nov 2021 17:29] Andreas Sachs
Hi Filipe,

>I'll set this bug as "verified" but only to get us improving the description of the property "connectionCollation".

I think that the inaccurate documentation has led me on the wrong track and I therefore suspected a bug. Adapting the description is certainly helpful. I don't have any more problems either, since I now know how to set the properties.

Thanks for your investigation.
[20 Dec 2022 17:45] Daniel So
Posted by developer:
 
This has become a bug for the documentation, which has been fixed now, with the documentation changes visible with the 8.0.32 release.