Bug #79977 utf8mb4_unicode_520_ci don't make sense for Japanese FTS
Submitted: 14 Jan 2016 5:46 Modified: 24 Jan 8:45
Reporter: tsubasa tanaka (OCA) Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S4 (Feature request)
Version:Any OS:Any
Assigned to: CPU Architecture:Any
Tags: 8.0
Triage: Needs Triage: D5 (Feature request)

[14 Jan 2016 5:46] tsubasa tanaka
Description:
This issue relates Bug #76553.

[MySQL Bugs: #76553: Sushi-Beer issue of MySQL with utf8mb4](https://bugs.mysql.com/bug.php?id=76553)

This explains that Bug #76553's another side problem for Japanese FullText Search.

* Japanese user use some types for expressing word.
* For example, the word means "mother", is expressed "はは" (Hiragana Case) and "ハハ" (Wide-size Katakana Case) and "ハハ" (Narrow-size Katakana Case).
  * They have same sound and same mean. We would like to take these as same character, in most of all case.
* utf8mb4_bin, utf8mb4_general_ci are Hiragana-Katakana Case Sensitive, utf8mb4_unicode_ci is Hiragana-Katakana Case Insensitive.

* The other hand, we had Youon Case.
* For example, we write "hospital" as "びょういん", and we write "heir salon" as "びよういん".
  * They have only one difference of character "ょ" and "よ", and these two characters are very similar form.
  * But, they have different sound and different mean in most of all cases. We'd like to distinct them.
* utf8mb4_bin and uft8mb4_general_ci are Youon Case Sensitive, utf8mb4_unicode_ci is Youon Case Insensitive.

* There's more expression in Japanese, we have Dakuten and Handakuten.
* As explaining in Bug #76553, "ハハ" (Plain Case) means "mother", "ババ" (Dakuten Case) means "grand mother", "パパ" (Handakuten Case) means "daddy".
  * These have to be distincted.
* utf8mb4_bin and utf8mb4_general_ci are Dakuten-Handakuten Case Sensitive, utf8mb4_unicode_ci is Dakuten-Handakuten Case Insensitive.

* Last one is Wide Case and Narrow Case.
* For example, "MySQL" (Narrow Case) and "MySQL" (Wide Case), of course they have same sound and same mean.
  * So we'd like to take them same character.
* utf8mb4_bin and utf8mb4_general_ci are Wide-Narrow Case Sensitive, utf8mb4_unicode_ci is Wide-Narrow Case Insensitive.

* Hiragana-Katakana cs and Wide-Narrow cs are incombinient but we can be patient.
* But Youon ci and Dakuten-Handakuten ci are critical issue on FTS query(simple query using LIKE operator too)
  * Is anyone who expects search results of "hair salon"(びよういん) when they type "hospital"(びょういん) ?
* And these ci includes potential of breaking UNIQUE KEY constraint.
  * When `UNIQUE KEY(relation_from_me)`, I have a mother and INSERT INTO "mother"(ハハ), and I have a father and INSERT INTO "daddy"(パパ), second query will fail because of duplicate key error.

Discussion in [Planning the defaults for MySQL 5.8 | MySQL Server Blog](http://mysqlserverteam.com/planning-the-defaults-for-mysql-5-8/) says "server default collation will be utf8mb4_unicode_520_ci".
But it will make Japanese(and other countries?) users to be confusing.

Honestly, we really need utf8mb4_japanese_ci.
But at least, utf8mb4_unicode*_ci are critical for Japanese.

|                    | utf8mb4_bin | utf8mb4_general_ci | utf8mb4_unicode_ci | utf8mb4_unicode_520_ci|
|--------------------|-------------|--------------------|--------------------|-----------------------|
| Hiragana-Katakana  | cs (unkind) | cs (unkind)        | ci (good)          | ci(good)              |
| Youon              | cs (good)   | cs (good)          | ci (critical)      | ci(critical)          |
| Dakuten-Handakuten | cs (good)   | cs (good)          | ci (critical)      | ci(critical)          |
| Wide-Narrow        | cs (unkind) | cs (unkind)        | ci (good)          | ci(good)              |
| Sushi-Beer         | cs          | ci                 | ci                 | cs                    |

How to repeat:
See description.

Suggested fix:
If you change default value, that should not be utf8mb4_unicode*_ci, would be utf8mb4_general_ci or utf8mb4_bin.
[19 Jan 2016 8:19] Umesh Shastry
Hello Tanaka-San,

Thank you for the report and feedback!

Thanks,
Umesh
[3 Apr 2017 5:51] tsubasa tanaka
I seem this FR has been realized by utf8mb4_ja_0900_as_cs in 8.0.1!!!

|                    | utf8mb4_bin | utf8mb4_general_ci | utf8mb4_unicode_ci | utf8mb4_unicode_520_ci | utf8mb4_ja_0900_as_cs |
|--------------------|-------------|--------------------|--------------------|------------------------|-----------------------|
| Hiragana-Katakana  | cs (unkind) | cs (unkind)        | ci (good)          | ci(good)               | ci(good)              |
| Youon              | cs (good)   | cs (good)          | ci (critical)      | ci(critical)           | cs(good)              |
| Dakuten-Handakuten | cs (good)   | cs (good)          | ci (critical)      | ci(critical)           | cs(good)              |
| Wide-Narrow        | cs (unkind) | cs (unkind)        | ci (good)          | ci(good)               | ci(good)              |
| Sushi-Beer         | cs          | ci                 | ci                 | cs                     | cs                    |
[24 Jan 8:45] Xing Zhang
Posted by developer:
 
Fixed by adding utf8mb4_ja_0900_as_cs and utf8mb4_ja_0900_as_cs_ks.