MySQL Bugs: #100534: MYSQL_FIELD length different when MySQL client and server version do not match

Bug #100534	MYSQL_FIELD length different when MySQL client and server version do not match
Submitted:	14 Aug 2020 15:59	Modified:	18 Aug 2020 7:49
Reporter:	Sebastien FLAESCH	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: C API (client library)	Severity:	S4 (Feature request)
Version:	8.0.19	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
This is a new bug report, because I disagree with the duplicated state of

     https://bugs.mysql.com/bug.php?id=100511

Sorry to insist, but I have commented in 100511, but I see no feedback, so I create a new report here... I wish I could re-open tickets closed by the MySQL verification team, but have no permission for that.

SO PLEASE REOPEN 100511, MAKE TESTS, AND PROVE ME THAT I AM WRONG AND THAT YOU GET THE SAME LENGTH WITH THE CLIENT 8.0 + SERVER 5.7 COMBINATION!

The issue can be closed in 3 ways:

A) You cannot reproduce with the Client 8.0 / Server 5.7 combination, so I must have something wrong on my system, I will accept this after seeing your test results.

B) You can reproduce what I see, and this is an expected behavior: Then the documentation should clarify why we get a different length when C/S versions do not match.

C) You can reproduce, and this is an unexpected behavior: It's a MySQL bug, to be fixed, as I suggest since the beginning (https://bugs.mysql.com/bug.php?id=100401).

How to repeat:
Check https://bugs.mysql.com/bug.php?id=100511

PLEASE TRY TO REPRODUCE!

Suggested fix:
To MySQL verification team:

Properly understand, then try to reproduce what customers report, before closing bug reports...

Are you not seeing the problem described here:

https://dev.mysql.com/doc/refman/8.0/en/charset-connection.html#charset-connection-error-h...

in the paragraph starting "The same problem occurs in a more subtle context..."?

Since you use an 8.0 client, it tries to impose the character set + collation as utf8mb4 + utf8mb4_0900_ai_ci. However the latter isn't recognized by the 5.7 server (the collation doesn't exist there) and its behaviour is to fall back to the default latin1.

I tried an 8.0 client against a 5.7 server with --default-character-set=utf8mb4 
and I get

mysql> show variables like "character_set_client";
+----------------------+--------+
| Variable_name        | Value  |
+----------------------+--------+
| character_set_client | latin1 |
+----------------------+--------+
1 row in set (0.00 sec)

as described in the manual.

Hi Erlend,

Thanks a lot this doc link. I wish someone showed me that in the initial bug report (https://bugs.mysql.com/bug.php?id=100401)

Since the behavior is documented, I accept it...

I totally understand that a server may not satisfy some client charset, especially when combining different C/S versions.

However, I would have expected a connection error instead of "silently" falling back to the server charset. If a client app is requiring UTF-8, it's to use UTF-8.

I suppose this behavior exists for some good reasons, but is a bit risky to me.

How are client apps supposed to cope with this? 

Since there is no SQL error, is there some diagnostic info that the client can check, to make sure that the server has selected the required client charset?
(If it exists, I could not read that in the doc section you provided...)

Otherwise, each client app would have to verify that the selected charset is the one it expects, by executing a SHOW SESSION VARIABLES LIKE 'character\_set\_%' ... right?

And if it's not the requested charset, is the client app supposed to adapt/limit the possible characters/languages to the selected charset? ... and warn the end user "Sorry, but you can only use latin1 characters with this database server" ...

I discussed this with Bernt Johnsen (who works on QA of new character set features). We don't think the problem can be that easily dismissed.

He pointed out the following example.  Put this SQL into a file, making sure it has UTF-8 encoding:

create table foo(v varchar(20) character set utf8mb4);
insert into foo values('語');
select hex(v) from foo;
select * from foo;

If you source this file in a mysql 8.0 client started with --default-character-set=utf8mb4, against an 8.0 server, the last two statements will return:

+--------+
| hex(v) |
+--------+
| E8AA9E |
+--------+
1 row in set (0.00 sec)

+------+
| v    |
+------+
| 語   |
+------+
1 row in set (0.00 sec)

which is fine, as E8AA9E is the utf8mb4 representation of Unicode U+8A9E or '語'.

Now try exactly the same thing, but with the 8.0 client against a 5.7 server. The server will fall back to latin1, as previously said, and you will get

+--------------+
| hex(v)       |
+--------------+
| C3A8C2AAC5BE |
+--------------+
1 row in set (0.00 sec)

+------+
| v    |
+------+
| 語   |
+------+
1 row in set (0.00 sec)

Hence we insert garbage, but the conversion back to the client "works", so the user might be fooled into believing that everything is correct.

If we issue "SET NAMES utf8", the garbage comes to light, since SELECT * will return:

+--------+
| v      |
+--------+
| èªž    |
+--------+
1 row in set (0.00 sec)

While the behaviour might be in line with what the manual states, it's hardly very logical or user-friendly. Rejecting the connection might be a better idea.

Bernt has written a useful introduction to Unicode and Collations in 8.0:

https://www.slideshare.net/BerntMariusJohnsen/unicode-and-collations-in-mysql-80

What we see in this example is basically the same as what he descibes on slide 25/26.

Hello Erlend!

Glad to read that you guys have considered my suggestion.

Thanks for the test and the link, I will have a look.

Seb

Verified as a usability improvement. Not sure if it's doable in the current releases though.

Now the big question:

How to configure a MySQL Client 8.0 charset for UTF-8 in .my.cnf settings, to connect to a MySQL 5.7 server using UTF-8?

Today it can connect, fallback to latin1 and then you can do a "SET NAMES utf8"

If the connection would now raise an error, you would have to configure the client for latin1, then connect, and do a "SET NAMES utf8"... right?

Would that be the proper way to solve to charset/collation incompatibilities?

Doc says: The default collation for utf8mb4 differs between MySQL 5.7 and 8.0 (utf8mb4_general_ci for 5.7, utf8mb4_0900_ai_ci for 8.0)

Why does 8.0 no longer support 5.7 charset/collations? sounds like a backward incompatibility...

Would there not be a smarter solution? Have some alias/mapping for collations?

What does actually happen when using "SET NAMES utf8"?

I have then tested with:

[client]
default-character-set="utf8"

And that works...

What is "utf8" versus "utf8mb4" versus "utf8mb4_general_ci" ?

I know mb4 is max byte length of 4, general_ci is the collation, but what maxbl and collation rules do we get with "utf8" then?

I need to teach myself a bit more on MySQL charsets and collations...

Seb

No collations or character sets were removed in 8.0; utf8mb4_general_ci is still supported, it's just not the default any more for utf8mb4.

SET NAMES <character set> is shorthand for setting three system variables:

https://dev.mysql.com/doc/refman/8.0/en/set-names.html

"utf8" is currently a synonym for utf8mb3 (yes, "3"). I should have said "SET NAMES utf8mb4" in my example (it doesn't make a material difference in that case though).

See 

https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html

for deprecation info concerning utf8mb3.

I suppose one way of making sure it works on both 5.7 and 8.0 is to use the same character set + collation on both versions, like utf8mb4 + utf8mb4_general_ci.