Bug #111818 An improved idea to avoid writing garbled code in MySQL
Submitted: 19 Jul 2023 15:12 Modified: 21 Jul 2023 2:51
Reporter: tao jiang Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:all OS:CentOS (CentOS Linux release 7.9.2009 (AltArch))
Assigned to: CPU Architecture:Any
Tags: charset, garbled code

[19 Jul 2023 15:12] tao jiang
Description:
When I was doing the following test:

mysql> show variables like '%character%';
+--------------------------+-----------------------+
| Variable_name            | Value                 |
+--------------------------+-----------------------+
| character_set_client     | gbk                   |
| character_set_connection | gbk                   |
| character_set_database   | utf8mb4               |
| character_set_filesystem | binary                |
| character_set_results    | utf8mb4               |
| character_set_server     | utf8mb4               |
| character_set_system     | utf8                  |
| character_sets_dir       | /data/share/charsets/ |
+--------------------------+-----------------------+
8 rows in set (0.01 sec)

mysql> show create table tt1\G
*************************** 1. row ***************************
       Table: tt1
Create Table: CREATE TABLE `tt1` (
  `id` int(11) DEFAULT NULL,
  `name` varchar(10) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8
1 row in set (0.00 sec)

mysql> insert into tt1 values(1,'天');
ERROR 1406 (22001): Data too long for column 'name' at row 1

The '天' is a Chinese character. That column of name is varchar(10),and I just insert one character,it say data too long.

Then I looked at the source code,I found that when character_set_client=gbk and character_set_connection=gbk but the client is a utf8 character set, byte loss occurred while writing data. MySQL believes that there has been an important data loss, so the data is too long.

Then I just set character_set_connection to utf8mb4. Then execute the above insert command and find that the command is successfully executed(Obviously, the data stored in the name column is garbled at this point). The presentation is as follows:

mysql> set character_set_connection=utf8mb4;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into tt1 values(2,'天');
Query OK, 1 row affected (0.00 sec)

mysql> select * from tt1;
+------+------+
| id   | name |
+------+------+
|    2 | 澶   |
+------+------+
1 row in set (0.00 sec)

The insert command is successful because the byte loss does not occur when storing data in the table, but when it is converted to character_set_connection (utf8mb4) encoded byte sequences.

Therefore, I feel that the MySQL code logic can be optimized: determine that it is in the premise of writing data, as long as byte loss occurs to return an error, so as to avoid writing garbled data when the character set is not set correctly?

Looking forward to your reply, thank you.

How to repeat:
set character_set_client = gbk;

set character_set_connection = gbk;

CREATE TABLE `tt1` (  `id` int(11) DEFAULT NULL,  `name` varchar(10) DEFAULT NULL) character set utf8;

insert into tt1 values(2,'天');

tips:
The client character set should be utf8, so you can access the database through mysql client in Linux OS.

Suggested fix:
When writing data, whenever byte loss occurs, the error message that the write failed is returned. In this way, to avoid the occurrence of data storage garbled characters
[19 Jul 2023 15:16] tao jiang
I think this will help improve MySQL's ease of use.
[20 Jul 2023 7:17] tao jiang
Is anybody there?
[20 Jul 2023 11:44] MySQL Verification Team
Hi Mr. jiang,

Thank you for your bug report.

However, we can not accept your idea.

Our entire character set interface is designed according to SQL standard. All the character sets and their place holders, like the one that you mention:

| character_set_client     | gbk                   |
| character_set_connection | gbk                   |
| character_set_database   | utf8mb4               |
| character_set_filesystem | binary                |
| character_set_results    | utf8mb4               |
| character_set_server     | utf8mb4               |
| character_set_system     | utf8                  |

are defined by a standard that was followed to the last letter. Hence, this can not be changed.

Not a bug.
[21 Jul 2023 2:51] tao jiang
Hi, team. Thank you for your reply.
It is not a bug exactly. Write at bugs.mysql.com is a little inappropriate. It should be said that this is actually a suggestion for improvement.

I still think that as long as byte loss occurs during a write operation, you should return an error, which largely avoids garbled codes.
So the inability to return an error is also due to compliance with SQL standards?
Looking forward to your reply!
[21 Jul 2023 12:20] MySQL Verification Team
Hi Mr. jiang,

Yes, number of bytes for each character set is also a part of the standard.

That is why there are graphenes out there that exist in utfmb4, but not in utfmb3, for example.

Each character set is defined by a number of attributes, one of which is the maximum length of each character.