MySQL Bugs: #85546: strnxfrmlen is too conservative for unicode 9.0.0 collations

Bug #85546	strnxfrmlen is too conservative for unicode 9.0.0 collations
Submitted:	20 Mar 2017 14:33	Modified:	4 May 2017 12:44
Reporter:	Steinar Gunderson	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:		OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
Under utf8mb4_0900_ai_ci, strnxfrmlen() for a VARCHAR(1) (taking four bytes) returns 32. The rationale for this choice (strnxfrm_multiply=8) is not documented. However, there is no character with more than eight weights, giving 16 bytes, so we are creating twice as long sort keys as we need to.

For as_cs, this is more complicated; we again have a 2x bloat factor, but we need four static bytes for the weight separators. Thankfully, these two effects go against each other, so the bloat absorbs the weight separators in all cases. Similarly, for utf8mb4_ja_0900_as_cs, we add extra weights for some characters on the primary level, but again, the extra bloat happens to save us.

How to repeat:
N/A

Suggested fix:
Change from strnxfrmlen_simple to a custom-built function that takes into account weight separators, reordering and the likes, and set tight bounds for these collations.

Posted by developer:
 
Noted in 8.0.2 changelog.