Bug #81196 Connector/J Not Following Database character set
Submitted: 26 Apr 2016 1:40 Modified: 9 Aug 2018 14:50
Reporter: Jason Hunter Email Updates:
Status: Closed Impact on me:
None 
Category:Connector / J Severity:S2 (Serious)
Version:5.1.38+ OS:Any
Assigned to: CPU Architecture:Any
Tags: utf8mb4

[26 Apr 2016 1:40] Jason Hunter
Description:
When server is set to use the default latin1 character set and collation there is no way to insert 4 byte unicode characters with Connector/J. Even if the databaseis set to use character set utf8mb4.

How to repeat:
my.cnf
[mysqld]
character-set-server=latin1
collation-server=latin1_swedish_ci

CREATE DATABASE `TestDb` CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
use TestDb;

CREATE TABLE `TestTable` (
  `id`                int          AUTO_INCREMENT NOT NULL,
  `name`         varchar(50)  NULL,
  CONSTRAINT `PK_LastViewedMatch_id` PRIMARY KEY  (`id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 CHARSET=utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

With connector/j (in mysql workbench) try to insert 4 byte unicode character:
INSERT INTO TestTable(name) VALUES ('𠜎');
Error : Incorrect string value: '\xF0\xA0\x9C\x8E' for column 'name' at row 1

Adding ?useUnicode=true&characterEncoding=utf8&characterSetResults=utf8&connectionCollation=utf8mb4_unicode_ci has no effect.

SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';
character_set_client	utf8
character_set_connection	utf8
character_set_database	utf8mb4
character_set_filesystem	binary
character_set_results	utf8
character_set_server	latin1
character_set_system	utf8
collation_connection	utf8_general_ci
collation_database	utf8mb4_unicode_ci
collation_server	latin1_swedish_ci

Using the command line works:
mysql> INSERT INTO TestTable(name) VALUES ('𠜎');
Query OK, 1 row affected (0.00 sec)

mysql> Select * from TestTable;
+----+------+
| id | name |
+----+------+
|  1 | 𠜎     |
+----+------+
1 row in set (0.00 sec)

mysql> SHOW VARIABLES WHERE Variable_name LIKE 'character\_set\_%' OR Variable_name LIKE 'collation%';                                                                                                      +--------------------------+--------------------+
| Variable_name            | Value              |
+--------------------------+--------------------+
| character_set_client     | latin1             |
| character_set_connection | latin1             |
| character_set_database   | utf8mb4            |
| character_set_filesystem | binary             |
| character_set_results    | latin1             |
| character_set_server     | latin1             |
| character_set_system     | utf8               |
| collation_connection     | latin1_swedish_ci  |
| collation_database       | utf8mb4_unicode_ci |
| collation_server         | latin1_swedish_ci  |
+--------------------------+--------------------+
10 rows in set (0.00 sec)

Suggested fix:
In the MysqlIO.java file the serverCharsetIndex is set during the handshake process and then used again in the ConnectionImpl.java configureClientCharacterSet and since latin1 with an ID of 8 is not part of the CharsetMapping.UTF8MB4_INDEXES it doesn't let the character_set_connection or character_set_client get set to utf8mb4. 

                            boolean utf8mb4Supported = versionMeetsMinimum(5, 5, 2);
                            boolean useutf8mb4 = utf8mb4Supported && (CharsetMapping.UTF8MB4_INDEXES.contains(this.io.serverCharsetIndex));

                            if (!getUseOldUTF8Behavior()) {
                                if (dontCheckServerMatch || !characterSetNamesMatches("utf8") || (utf8mb4Supported && !characterSetNamesMatches("utf8mb4"))) {
                                    execSQL(null, "SET NAMES " + (useutf8mb4 ? "utf8mb4" : "utf8"), -1, null, DEFAULT_RESULT_SET_TYPE,
                                            DEFAULT_RESULT_SET_CONCURRENCY, false, this.database, null, false);
                                    this.serverVariables.put("character_set_client", useutf8mb4 ? "utf8mb4" : "utf8");
                                    this.serverVariables.put("character_set_connection", useutf8mb4 ? "utf8mb4" : "utf8");
                                }
                            } else {
                                ..............
                            }

During handshake the default should be set and then either overwritten with the database character set OR the useutf8mb4 should be set using the database character set index and not the this.io.serverCharsetIndex.
[29 Apr 2016 11:50] Chiranjeevi Battula
Hello  Jason Hunter,

Thank you for the bug report.
I could not repeat the issue at our end using with "SET NAMES"  and Connector / J 5.1.38. 
Please see attached sample test case.

Thanks,
Chiranjeevi.
[29 Apr 2016 11:51] Chiranjeevi Battula
test case

Attachment: 81196.java (text/x-java), 1.01 KiB.

[29 Apr 2016 17:02] Jason Hunter
We are using the "org.apache.commons.dbcp.BasicDataSource" class for our DB connection pool which has a method to execute a statement after connection, but adding the "SET NAMES" call had no effect. The warning at the bottom of https://dev.mysql.com/doc/connector-j/5.1/en/connector-j-reference-charsets.html says not to use the SET NAMES on the connector, and adding a SET NAMES request to execute before any database calls is not a scalable solution.

The line "boolean useutf8mb4 = utf8mb4Supported && (CharsetMapping.UTF8MB4_INDEXES.contains(this.io.serverCharsetIndex));" in ConnectionImpl.java (line ~1767) uses the default server charset to determine whether utf8mb4 should be used. So it won't matter what the column, table, or database is setup to support utf8mb4 if the server is not set to also support it. If a connection is to a particular DB then it should default to the charset supported by that DB and not the global server charset. 

If I add com.mysql.jdbc.faultInjection.serverCharsetIndex=224 to myt jdbc url then everything works fine because of this code for testing in ConnectionImpl.java (line ~1702):
if (this.props != null && this.props.getProperty("com.mysql.jdbc.faultInjection.serverCharsetIndex") != null) {
    this.io.serverCharsetIndex = Integer.parseInt(this.props.getProperty("com.mysql.jdbc.faultInjection.serverCharsetIndex"));
} 

Is there a way to get the charset on a database that could be added to the doHandshake method of MysqlIO.java that could supplement the serverCharsetIndex (like dbCharsetIndex) and then used in the ConnectionImpl to determine whether to support utf8mb4.
[29 Apr 2016 18:27] Mark Matthews
It appears more that there's some bug detecting whether UTF8MB4 can be used by the connection. It's not workable for the driver to automatically detect per-database character sets because there are many cases where there are multiple schemas referenced in the same SQL statement. The correct workaround is to do what you've stated, which is force the encoding with "characterEncoding=". If that's not working because UTF8MB4 can't be detected in your case, then that is the actual bug.
[29 Apr 2016 19:02] Jason Hunter
So the current logic assumes that even though the jdbc url is requesting utf-8 since the server isn't set to a character set that supports the utf8mb4 and since an SQL statement can reference multiple schemas then it will use utf8 instead of utf8mb4. Debugging through the driver code it correctly detects that we want utf8mb4 (given the db version and the characterEncoding=utf8) but then pushes us to non-mb4 cause the server doesn't globally support mb4 (in our case latin1). 

Hence it would seem the only workaround for the situation where the DB and server character encodings are not "compatible" would be to either add a characterEncoding type of utf8mb4 which would force the encoding no matter the server encoding or add some other parameter that would force using utf8mb4.
[4 May 2016 12:11] Alexander Soklakov
Hi Jason,

We can't change the current logic of forcing the encoding with "characterEncoding=UTF-8" to utf8mb4 because we could affect many existing applications. But your suggestion about additional parameter is correct, we need a way to set utf8bm4 explicitly and "characterEncoding" should accept only Java encoding names.
Actually, we have a good candidate, the "connectionCollation" option you already tried, so the patch could look like:
===
diff --git a/src/com/mysql/jdbc/ConnectionImpl.java b/src/com/mysql/jdbc/ConnectionImpl.java
index 9da30ea..854ae59 100644
--- a/src/com/mysql/jdbc/ConnectionImpl.java
+++ b/src/com/mysql/jdbc/ConnectionImpl.java
@@ -1762,7 +1762,8 @@
                             // charset names are case-sensitive
 
                             boolean utf8mb4Supported = versionMeetsMinimum(5, 5, 2);
-                            boolean useutf8mb4 = utf8mb4Supported && (CharsetMapping.UTF8MB4_INDEXES.contains(this.io.serverCharsetIndex));
+                            boolean useutf8mb4 = utf8mb4Supported && (CharsetMapping.UTF8MB4_INDEXES.contains(this.io.serverCharsetIndex)
+                                    || (getConnectionCollation() != null && StringUtils.startsWithIgnoreCase(getConnectionCollation(), "utf8mb4")));
 
                             if (!getUseOldUTF8Behavior()) {
                                 if (dontCheckServerMatch || !characterSetNamesMatches("utf8") || (utf8mb4Supported && !characterSetNamesMatches("utf8mb4"))) {
===
I.e. with this change we should specify both characterEncoding=utf8 and connectionCollation=utf8mb4_unicode_ci to have SET NAMES issued.
[4 May 2016 13:35] Chiranjeevi Battula
http://bugs.mysql.com/bug.php?id=76889 marked as duplicate of this one.
[16 May 2016 4:47] monty solomon
We experienced a similar issue when the server character set was set to utf8.

We updated the settings in my.cnf to use utf8mb4.

[client]
loose-default-character-set = utf8mb4

[mysqld]
character-set-server = utf8mb4
[20 Dec 2016 14:13] Oliver Behncke
We witness the same problem. The database charset utf8mb4 is ignored in a jdbc connection.

Is there any news on this topic?
[14 Jun 2018 13:26] Eivind Larsen
Are there any news on this?

I seem unable to give any hint to the client that it should use utf8mb4 over utf8 even when the database (scheme) and all the tables are set to utf8.

The SET NAMES workaround proposed here does not work for the jdbc driver as explained in the docs:

> Warning
> Do not issue the query set names with Connector/J, as the driver will not detect that the 
> character set has changed, and will continue to use the character set detected during the 
> initial connection setup.
[14 Jun 2018 13:27] Eivind Larsen
> even when the database (scheme) and all the tables are set to utf8.

I of course meant 'even when the database (scheme) and all the tables are set to utf8mb', not utf8.
[9 Aug 2018 14:50] Daniel So
Posted by developer:
 
Added the following entry to the Connector/J 5.1.47 and 8.0.13 changelogs:

"The value UTF-8 for the connection property characterEncoding now maps to the utf8mb4 character set on the server and, for MySQL Server 5.5.2 and later, characterEncoding=UTF-8 can now be used to set the connection character set to utf8mb4 even if character_set_server has been set to something else on the server. (Before this change, the server must have character_set_server=utf8mb4 for Connector/J to use that character set.)

Also, if the connection property connectionCollation is also set and is incompatible with the value of characterEncoding, characterEncoding will be overridden with the encoding corresponding to connectionCollation.

See Using Character Sets and Unicode for details, including how to use the utf8mb3 character set now for connection."
[27 Sep 2018 23:41] Tianshi Wang
When will 8.0.13 be released?