Bug #85546 strnxfrmlen is too conservative for unicode 9.0.0 collations
Submitted: 20 Mar 2017 14:33 Modified: 4 May 2017 12:44
Reporter: Steinar Gunderson Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version: OS:Any
Assigned to: CPU Architecture:Any

[20 Mar 2017 14:33] Steinar Gunderson
Description:
Under utf8mb4_0900_ai_ci, strnxfrmlen() for a VARCHAR(1) (taking four bytes) returns 32. The rationale for this choice (strnxfrm_multiply=8) is not documented. However, there is no character with more than eight weights, giving 16 bytes, so we are creating twice as long sort keys as we need to.

For as_cs, this is more complicated; we again have a 2x bloat factor, but we need four static bytes for the weight separators. Thankfully, these two effects go against each other, so the bloat absorbs the weight separators in all cases. Similarly, for utf8mb4_ja_0900_as_cs, we add extra weights for some characters on the primary level, but again, the extra bloat happens to save us.

How to repeat:
N/A

Suggested fix:
Change from strnxfrmlen_simple to a custom-built function that takes into account weight separators, reordering and the likes, and set tight bounds for these collations.
[4 May 2017 12:44] Paul DuBois
Posted by developer:
 
Noted in 8.0.2 changelog.