Bug #100534 | MYSQL_FIELD length different when MySQL client and server version do not match | ||
---|---|---|---|
Submitted: | 14 Aug 2020 15:59 | Modified: | 18 Aug 2020 7:49 |
Reporter: | Sebastien FLAESCH | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: C API (client library) | Severity: | S4 (Feature request) |
Version: | 8.0.19 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[14 Aug 2020 15:59]
Sebastien FLAESCH
[17 Aug 2020 5:15]
Erlend Dahl
Are you not seeing the problem described here: https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html#charset-connection-error-h... in the paragraph starting "The same problem occurs in a more subtle context..."? Since you use an 8.0 client, it tries to impose the character set + collation as utf8mb4 + utf8mb4_0900_ai_ci. However the latter isn't recognized by the 5.7 server (the collation doesn't exist there) and its behaviour is to fall back to the default latin1. I tried an 8.0 client against a 5.7 server with --default-character-set=utf8mb4 and I get mysql> show variables like "character_set_client"; +----------------------+--------+ | Variable_name | Value | +----------------------+--------+ | character_set_client | latin1 | +----------------------+--------+ 1 row in set (0.00 sec) as described in the manual.
[17 Aug 2020 7:16]
Sebastien FLAESCH
Hi Erlend, Thanks a lot this doc link. I wish someone showed me that in the initial bug report (https://bugs.mysql.com/bug.php?id=100401) Since the behavior is documented, I accept it... I totally understand that a server may not satisfy some client charset, especially when combining different C/S versions. However, I would have expected a connection error instead of "silently" falling back to the server charset. If a client app is requiring UTF-8, it's to use UTF-8. I suppose this behavior exists for some good reasons, but is a bit risky to me. How are client apps supposed to cope with this? Since there is no SQL error, is there some diagnostic info that the client can check, to make sure that the server has selected the required client charset? (If it exists, I could not read that in the doc section you provided...) Otherwise, each client app would have to verify that the selected charset is the one it expects, by executing a SHOW SESSION VARIABLES LIKE 'character\_set\_%' ... right? And if it's not the requested charset, is the client app supposed to adapt/limit the possible characters/languages to the selected charset? ... and warn the end user "Sorry, but you can only use latin1 characters with this database server" ...
[17 Aug 2020 17:46]
Erlend Dahl
I discussed this with Bernt Johnsen (who works on QA of new character set features). We don't think the problem can be that easily dismissed. He pointed out the following example. Put this SQL into a file, making sure it has UTF-8 encoding: create table foo(v varchar(20) character set utf8mb4); insert into foo values('語'); select hex(v) from foo; select * from foo; If you source this file in a mysql 8.0 client started with --default-character-set=utf8mb4, against an 8.0 server, the last two statements will return: +--------+ | hex(v) | +--------+ | E8AA9E | +--------+ 1 row in set (0.00 sec) +------+ | v | +------+ | 語 | +------+ 1 row in set (0.00 sec) which is fine, as E8AA9E is the utf8mb4 representation of Unicode U+8A9E or '語'. Now try exactly the same thing, but with the 8.0 client against a 5.7 server. The server will fall back to latin1, as previously said, and you will get +--------------+ | hex(v) | +--------------+ | C3A8C2AAC5BE | +--------------+ 1 row in set (0.00 sec) +------+ | v | +------+ | 語 | +------+ 1 row in set (0.00 sec) Hence we insert garbage, but the conversion back to the client "works", so the user might be fooled into believing that everything is correct. If we issue "SET NAMES utf8", the garbage comes to light, since SELECT * will return: +--------+ | v | +--------+ | 語 | +--------+ 1 row in set (0.00 sec) While the behaviour might be in line with what the manual states, it's hardly very logical or user-friendly. Rejecting the connection might be a better idea. Bernt has written a useful introduction to Unicode and Collations in 8.0: https://www.slideshare.net/BerntMariusJohnsen/unicode-and-collations-in-mysql-80 What we see in this example is basically the same as what he descibes on slide 25/26.
[18 Aug 2020 7:07]
Sebastien FLAESCH
Hello Erlend! Glad to read that you guys have considered my suggestion. Thanks for the test and the link, I will have a look. Seb
[18 Aug 2020 7:11]
Erlend Dahl
Verified as a usability improvement. Not sure if it's doable in the current releases though.
[18 Aug 2020 7:49]
Sebastien FLAESCH
Now the big question: How to configure a MySQL Client 8.0 charset for UTF-8 in .my.cnf settings, to connect to a MySQL 5.7 server using UTF-8? Today it can connect, fallback to latin1 and then you can do a "SET NAMES utf8" If the connection would now raise an error, you would have to configure the client for latin1, then connect, and do a "SET NAMES utf8"... right? Would that be the proper way to solve to charset/collation incompatibilities? Doc says: The default collation for utf8mb4 differs between MySQL 5.7 and 8.0 (utf8mb4_general_ci for 5.7, utf8mb4_0900_ai_ci for 8.0) Why does 8.0 no longer support 5.7 charset/collations? sounds like a backward incompatibility... Would there not be a smarter solution? Have some alias/mapping for collations? What does actually happen when using "SET NAMES utf8"? I have then tested with: [client] default-character-set="utf8" And that works... What is "utf8" versus "utf8mb4" versus "utf8mb4_general_ci" ? I know mb4 is max byte length of 4, general_ci is the collation, but what maxbl and collation rules do we get with "utf8" then? I need to teach myself a bit more on MySQL charsets and collations... Seb
[18 Aug 2020 8:42]
Erlend Dahl
No collations or character sets were removed in 8.0; utf8mb4_general_ci is still supported, it's just not the default any more for utf8mb4. SET NAMES <character set> is shorthand for setting three system variables: https://dev.mysql.com/doc/refman/8.0/en/set-names.html "utf8" is currently a synonym for utf8mb3 (yes, "3"). I should have said "SET NAMES utf8mb4" in my example (it doesn't make a material difference in that case though). See https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html for deprecation info concerning utf8mb3. I suppose one way of making sure it works on both 5.7 and 8.0 is to use the same character set + collation on both versions, like utf8mb4 + utf8mb4_general_ci.