Bug #76889 Setting utf8mb4 character encoding
Submitted: 29 Apr 2015 14:57 Modified: 24 Jan 2022 14:49
Reporter: Vyacheslav Gerasymenko Email Updates:
Status: Can't repeat Impact on me:
None 
Category:Connector / J Severity:S4 (Feature request)
Version:5.1.34 OS:Any
Assigned to: Filipe Silva CPU Architecture:Any
Tags: character set, utf8mb4

[29 Apr 2015 14:57] Vyacheslav Gerasymenko
Description:
As stated here:

http://dev.mysql.com/doc/connector-j/en/connector-j-reference-charsets.html

The character encoding between client and server is automatically detected by driver upon connection and it must be set on the server using the character_set_server system variable.

For example, to use 4-byte UTF-8 character sets with Connector/J, MySQL server must be configured with character_set_server=utf8mb4, characterEncoding must not be set and Connector/J will then autodetect the UTF-8 setting.

This is great if you have direct access to config files of MySQL server. But if you use, for example, Amazon RDS service with MySQL engine, you don’t have such access. And you can’t set character_set_server system variable, which has default ‘latin1’ value.

Setting characterEncoding=utf-8 and useUnicode=yes results in:

character_set_client - utf8
character_set_connection - utf8

And this is correct according to documentation and logic of code in configureClientCharacterSet method of ConnectionImpl class.

But, what should I do if I need to use utf8mb4 encoding, set in database, its tables and columns and I can’t change server system variables?

How to repeat:
Perform any insert/update operation on any table with utf8mb4 encoding without setting character_set_server to utf8mb4, using, for example some emoji characters like this (not sure, if they would be processed and displayed correctly here):

Suggested fix:
Add additional connection property that overrides using of utf8mb4 encoding in client connection regardless of server character set (of course, if server supports such encoding, i.e. at least 5.5.2 according to code in configureClientCharacterSet method of ConnectionImpl class).

This additional property should be checked in next code fragment of configureClientCharacterSet method of ConnectionImpl class:

boolean utf8mb4Supported = versionMeetsMinimum(5, 5, 2);
boolean useutf8mb4 = utf8mb4Supported && (CharsetMapping.UTF8MB4_INDEXES.contains(this.io.serverCharsetIndex));

if (!getUseOldUTF8Behavior()) {
    if (dontCheckServerMatch || !characterSetNamesMatches("utf8") || (utf8mb4Supported && !characterSetNamesMatches("utf8mb4"))) {
        execSQL(null, "SET NAMES " + (useutf8mb4 ? "utf8mb4" : "utf8"), -1, null, DEFAULT_RESULT_SET_TYPE,
                DEFAULT_RESULT_SET_CONCURRENCY, false, this.database, null, false);
    }
} else {
    execSQL(null, "SET NAMES latin1", -1, null, DEFAULT_RESULT_SET_TYPE, DEFAULT_RESULT_SET_CONCURRENCY, false, this.database,
            null, false);
}

Something like this:

boolean useutf8mb4 = utf8mb4Supported && (getForceUtf8mb4() || CharsetMapping.UTF8MB4_INDEXES.contains(this.io.serverCharsetIndex));

Where getForceUtf8mb4 method check if new “forceUtf8mb4” connection property is set to “true”. Of course, this is just a suggestion.

Or allow setting utf8mb4 encoding in characterEncoding property – but this seems be impossible, since this property must contain Java style character encoding name, which is simply UTF-8 in such case and which maps to utf8 MySQL character set name, and not to utf8mb4.
[29 Apr 2015 15:01] Vyacheslav Gerasymenko
"How to repeat" block was trunckated, since this engine failed to process sample emoji characters (support for full UTF-8 character set is required to process them). Copied them here:

http://pastebin.com/pmrFPR2X
[5 May 2015 18:53] Filipe Silva
Hi Vyacheslav,

Thank you for this feature request. We are analyzing its viability.
[26 Jun 2015 16:14] Filipe Silva
Hi,

Your request is perfectly acceptable. You can, however, achieve the same results by issuing a "SET NAMES utf8mb4" right after establishing the connection. Please let us know if this works for you at the moment.

Thank you,
[2 Jul 2015 16:57] Vyacheslav Gerasymenko
Hi!

Yes, same result can be achieved by separate execution "SET NAMES utf8mb4" after connection establishment and yes, it works for me so, but this approach has two drawbacks:

1. Some performance degradation, since in every connection to MySQL server two initialization statements will be executed: "SET NAMES utf8" (in configureClientCharacterSet method of ConnectionImpl class) and "SET NAMES utf8mb4" (executed manually after connection establishment).

2. Manually executing "SET NAMES utf8mb4" statement after opening connection and before every regular statement such as select, update, insert, which work with UTF-8 strings is not very usable approach.

So, it would be handy and more efficient to set utf8mb4 character set globally - via additional connection property, which can be configured in code, config files or JDBC resource of Java EE App Server just once, by programmer or administrator.
[1 Dec 2015 11:35] Bora Erbas
Hi Filipe,

I am not sure if "SET NAMES UTF8MB4" would work.
The MySQL Connector/J documentation explicitly states below:

https://dev.mysql.com/doc/connector-j/en/connector-j-reference-charsets.html
Warning
Do not issue the query set names with Connector/J, as the driver will not detect that the character set has changed, and will continue to use the character set detected during the initial connection setup.

So if the SET NAMES call would work as you suggested; then the above documentation is inaccurate?
Or am I missing something? 

I tested this and here is what I get. 
I need to be able to insert emoticons to a MySQL table so I need utf8mb4. I am using EclipseLink, JPA btw.
When I run "SET NAMES UTF8MB4" query after getting the connection, it still does not work if the character_set_server is currently set to latin1 on the server.
But if the character_set_server is currently set to utf8 (not utf8mb4) it works.

Any pointers are appreciated.

Regards,
Bora.
[4 May 2016 13:34] Chiranjeevi Battula
Marking as duplicate of Bug#81196
[24 Jan 2022 14:49] Alexander Soklakov
Posted by developer:
 
Connector/J 5.1 series came to EOL on Feb 9th, 2021, see https://www.mysql.com/support/eol-notice.html, so this bug will not be fixed there.

Character sets support was significantly reworked in Connector/J 8.0, please check the documentation https://dev.mysql.com/doc/connector-j/8.0/en/connector-j-reference-charsets.html, you could use "connectionCollation" instead of "characterEncoding" to set utf8 vs utf8mb4 connection charset.