Bug #112950 | mysql 5.5 Upgrade 8.0 exception caused by 0900 collate | ||
---|---|---|---|
Submitted: | 3 Nov 2023 8:20 | Modified: | 4 Nov 2023 11:15 |
Reporter: | xichao li (OCA) | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
Version: | 8.0 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[3 Nov 2023 8:20]
xichao li
[3 Nov 2023 8:23]
xichao li
attached sql file
Attachment: d2.sql (application/octet-stream, text), 51 bytes.
[3 Nov 2023 12:14]
MySQL Verification Team
HI Mr. li, Thank you for your bug report. Our strict policy is that you should upgrade from 5.5 to 5.6, then from 5.6 to 5.7 and lastly 5.7 to 8.0. However, we found that your analysis is correct. Verified as reported.
[3 Nov 2023 13:51]
Roy Lyseng
Posted by developer: This is not a bug. First, this is not an upgrade bug, since the table is created in 8.0 with a collation that was introduced in 8.0. Second, the data set includes code points that are indeed ignorable according to utf8mb4_0900_ai_ci. Thus, the strings '62284830187725972' and '<FEFF>62284830187725972' will be considered equal and cause a unique index violation. U+FEFF is code point ZERO WIDTH NO-BREAK SPACE, but if this from a UTF-16 it is a BOM (Byte Order Mark) for big-endian UTF-16. The workaround here will be to sanitize the data. It may be possible to replace the index with a binary collation.
[4 Nov 2023 11:15]
xichao li
Thank you for your feedback. When we combine the code, we see the implementation mechanism for string comparison at utf8mb4_0900. We also feel that this may be the logic of the utf8mb4_0900_ai_ci collation rule itself. But since no instructions were found, the bug was submitted for confirmation. The solution is also, as you said, by cleaning up abnormal data. For future databases involved in the upgrade, it is planned to change the collation rule back to the utf8mb4_general_ci before 8.0 to avoid the recurrence of this problem. After all, deleting data is not something DBAs are good at. In addition, since 8.0.1, utf8mb4_0900_ai_ci has been the default collation rule for the utf8mb4 character set. We can only see that the utf8mb4_0900_ai_ci is promoted relative to the utf8mb4_general_ci, but there is no complete indication of the difference. This may cause the user to fail to recognize the difference, causing the upgrade to fail. Finally, judging from our recent upgrades, two out of five database upgrades have this problem, and we suspect that more systems may have this problem. We are guessing: since the end user of the application may paste data from anywhere, then the application system inserts into the mysql database. If characters like <feff> are present, they are largely inserted into tables in the database. Since mysql 5.7 is nearing the end of its life cycle, this error will inevitably occur when it is upgraded to 8.0.