Bug #115824 | The same LOAD DATA 5.7 and 8.0 result in different | ||
---|---|---|---|
Submitted: | 13 Aug 2024 7:36 | Modified: | 19 Aug 2024 2:04 |
Reporter: | chong zhang | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
Version: | 8.0 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[13 Aug 2024 7:36]
chong zhang
[13 Aug 2024 7:39]
chong zhang
LOAD DATA LOCAL INFILE
Attachment: load data.zip (application/x-zip-compressed, text), 626 bytes.
[14 Aug 2024 9:32]
MySQL Verification Team
Hi Mr. zhang, This is actually not a bug. You have simply used a wrong collation. utfmb4_general_ci is not recommended for this kind of operations. Use instead utf8mb4_0900_ai_ci. How to set it is described in our Reference Manual. Also, if that does not work, try 8.0.39.
[15 Aug 2024 1:53]
chong zhang
mysql> select version(); +-----------+ | version() | +-----------+ | 8.0.33 | +-----------+ 1 row in set (0.00 sec) mysql> show variables like '%cha%'; +--------------------------------------+--------------------------------+ | Variable_name | Value | +--------------------------------------+--------------------------------+ | character_set_client | utf8mb4 | | character_set_connection | utf8mb4 | | character_set_database | utf8mb4 | | character_set_filesystem | binary | | character_set_results | utf8mb4 | | character_set_server | utf8mb4 | | character_set_system | utf8mb3 | | character_sets_dir | /opt/mysql3307/share/charsets/ | | innodb_change_buffer_max_size | 25 | | innodb_change_buffering | all | | session_track_state_change | OFF | | validate_password_special_char_count | 1 | +--------------------------------------+--------------------------------+ 12 rows in set (0.00 sec) mysql> show variables like '%coll%'; +-------------------------------+--------------------+ | Variable_name | Value | +-------------------------------+--------------------+ | collation_connection | utf8mb4_0900_ai_ci | | collation_database | utf8mb4_0900_ai_ci | | collation_server | utf8mb4_0900_ai_ci | | default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci | +-------------------------------+--------------------+ 4 rows in set (0.00 sec) mysql> show create table assp_sis_payres_imp_bak \G *************************** 1. row *************************** Table: assp_sis_payres_imp_bak Create Table: CREATE TABLE `assp_sis_payres_imp_bak` ( `AUTO_INC` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '自增列', `D_NAME` varchar(210) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL, PRIMARY KEY (`AUTO_INC`) ) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec) mysql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),''); Query OK, 2 rows affected (27.38 sec) Records: 2 Deleted: 0 Skipped: 0 Warnings: 0 mysql> mysql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),''); Query OK, 2 rows affected (0.00 sec) Records: 2 Deleted: 0 Skipped: 0 Warnings: 0 mysql> mysql> select * from ASSP_SIS_PAYRES_IMP_BAK; +----------+--------------------------------------------------------------------------+ | AUTO_INC | D_NAME | +----------+--------------------------------------------------------------------------+ | 7 | 台州市路桥区社会保险事业管理中心 | | 8 | 台州市路桥区社会保险事业管理中心 | | 10 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績 | | 11 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績 | +----------+--------------------------------------------------------------------------+ 4 rows in set (0.00 sec) I don't think we should limit the collation to utf8mb4_0900_ai_ci ,Because utfmb4_gener_ci also supports its use and has not been abandoned. now, in 8.0.33 . i try the collation to utf8mb4_0900_ai_ci , There are also garbled errors. Isn't this a bug ?
[15 Aug 2024 6:40]
chong zhang
hi, I tried 8.0.39 but also encountered garbled text.Expected response! mysql> select version(); +-----------+ | version() | +-----------+ | 8.0.39 | +-----------+ 1 row in set (0.00 sec) mysql> show variables like '%char%'; +--------------------------------------+--------------------------------+ | Variable_name | Value | +--------------------------------------+--------------------------------+ | character_set_client | utf8mb4 | | character_set_connection | utf8mb4 | | character_set_database | utf8mb4 | | character_set_filesystem | binary | | character_set_results | utf8mb4 | | character_set_server | utf8mb4 | | character_set_system | utf8mb3 | | character_sets_dir | /opt/mysql3309/share/charsets/ | | validate_password_special_char_count | 1 | +--------------------------------------+--------------------------------+ 9 rows in set (0.01 sec) mysql> mysql> show variables like '%coll%'; +-------------------------------+--------------------+ | Variable_name | Value | +-------------------------------+--------------------+ | collation_connection | utf8mb4_0900_ai_ci | | collation_database | utf8mb4_0900_ai_ci | | collation_server | utf8mb4_0900_ai_ci | | default_collation_for_utf8mb4 | utf8mb4_0900_ai_ci | +-------------------------------+--------------------+ 4 rows in set (0.00 sec) mysql> show create table assp_sis_payres_imp_bak\G *************************** 1. row *************************** Table: assp_sis_payres_imp_bak Create Table: CREATE TABLE `assp_sis_payres_imp_bak` ( `AUTO_INC` bigint unsigned NOT NULL AUTO_INCREMENT COMMENT '自增列', `D_NAME` varchar(210) CHARACTER SET utf8mb4 COLLATE utf8mb4_0900_ai_ci DEFAULT NULL, PRIMARY KEY (`AUTO_INC`) ) ENGINE=InnoDB AUTO_INCREMENT=13 DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_0900_ai_ci 1 row in set (0.00 sec) mysql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),''); Query OK, 2 rows affected (0.00 sec) Records: 2 Deleted: 0 Skipped: 0 Warnings: 0 mysql> LOAD DATA LOCAL INFILE '/root/dba_zc/load/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),''); Query OK, 2 rows affected (0.00 sec) Records: 2 Deleted: 0 Skipped: 0 Warnings: 0 mysql> select * from ASSP_SIS_PAYRES_IMP_BAK; +----------+--------------------------------------------------------------------------+ | AUTO_INC | D_NAME | +----------+--------------------------------------------------------------------------+ | 7 | 台州市路桥区社会保险事业管理中心 | | 8 | 台州市路桥区社会保险事业管理中心 | | 10 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績 | | 11 | 鍙板窞甯傝矾妗ュ尯绀句細淇濋櫓浜嬩笟绠$悊涓績 | +----------+--------------------------------------------------------------------------+ 4 rows in set (0.00 sec)
[16 Aug 2024 11:39]
MySQL Verification Team
Hi, Sorry for the previous comment, it was caused by an innocent mistake.
[16 Aug 2024 12:13]
MySQL Verification Team
HI Mr. zhang, You have used some conversions of your own, which are not supported by LOAD DATA in 8.0. Without your conversions we get the following results: +----------+-------------------------------------------------------------------------------------------------------------------------------------------+ | AUTO_INC | D_NAME | +----------+-------------------------------------------------------------------------------------------------------------------------------------------+ | 路桥区社会保险事业管理中心 | 2 | 台州市路桥区社会保险事业管理中心 | | 路桥区社会保险事业管理中心 | 5 | 台州市路桥区社会保险事业管理中心 | +----------+-------------------------------------------------------------------------------------------------------------------------------------------+ We just used: load data local infile '/tmp/test_gbk.txt' into table t1 CHARACTER SET GB18030 (d_name); load data local infile '/tmp/test_utf8.txt' into table t1 CHARACTER SET UTF8MB4 (d_name); However, it does not work correctly ....... while it should. Furthermore, auto-increment column values are not displayed at all. Hence, this is a low priority bug. We are verifying your bug for the version 8.0 and higher.
[19 Aug 2024 2:04]
chong zhang
I found a temporary way to avoid it . Before loading DATA, set @row=_binary''; From a practical perspective, there is no problem. May I do that ? select version(); +-----------+ | version() | +-----------+ | 8.0.25 | +-----------+ 1 row in set (0.00 sec) mysql> set @row=_binary''; Query OK, 0 rows affected (0.00 sec) mysql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_UTF8_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET UTF8MB4 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING UTF8MB4)),''); Query OK, 2 rows affected (0.01 sec) Records: 2 Deleted: 0 Skipped: 0 Warnings: 0 mysql> set @row=_binary''; Query OK, 0 rows affected (0.00 sec) mysql> LOAD DATA LOCAL INFILE '/home/greatdb/TEST_GB18030_bak.txt' IGNORE INTO TABLE ASSP_SIS_PAYRES_IMP_BAK CHARACTER SET GB18030 IGNORE 0 LINES (@row) SET `D_NAME` = NULLIF(TRIM(CONVERT(UNHEX(SUBSTR(HEX(@row), 1,240)) USING GB18030)),''); Query OK, 2 rows affected (0.01 sec) Records: 2 Deleted: 0 Skipped: 0 Warnings: 0 mysql> select * from assp_sis_payres_imp_bak; +----------+--------------------------------------------------+ | AUTO_INC | D_NAME | +----------+--------------------------------------------------+ | 1 | 台州市路桥区社会保险事业管理中心 | | 2 | 台州市路桥区社会保险事业管理中心 | | 4 | 台州市路桥区社会保险事业管理中心 | | 5 | 台州市路桥区社会保险事业管理中心 | +----------+--------------------------------------------------+ 4 rows in set (0.00 sec)
[19 Aug 2024 10:02]
MySQL Verification Team
Hi Mr. zhang, This is a workaround that will work in this set of conditions, collations and type of data contained in the files that are loaded. It might not work under other conditions ......
[23 Jul 8:06]
ximin liang
Hello All: This commit in 8.0.17 introduces problem. https://github.com/mysql/mysql-server/commit/5cc4330#diff-c12a237f468be9329e07e1e7a3ff345c.... After utf8-text was loaded, charset of user var `row` is set to utf8mb4. Then load gb18030-text, mysqld do next things: 1. first parser set utf8mb4 for Item_func_get_user_var, see Item_func_get_user_var::resolve_type 2. load text and set gb18030 info for user var row 3. finaly, in Item_func_get_user_var::val_str, for the reason collation of item is different with user var, gb18030 text is converted to uft8mb4 so garbled characters are generated.