Bug #78227 Characters versus bytes in index prefix
Submitted: 26 Aug 2015 14:18 Modified: 19 Feb 2016 16:30
Reporter: monty solomon Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Documentation Severity:S3 (Non-critical)
Version:5.6 OS:Any
Assigned to: Paul DuBois CPU Architecture:Any

[26 Aug 2015 14:18] monty solomon
Description:
The 13.7.5.23 SHOW INDEX Syntax page states

Sub_part

The number of indexed characters if the column is only partly indexed, NULL if the entire column is indexed.

The 8.3.4 Column Indexes page states

With col_name(N) syntax in an index specification, you can create an index that uses only the first N characters of a string column.

and

Note
Prefix limits are measured in bytes, while the prefix length in CREATE TABLE statements is interpreted as number of characters. Take this into account when specifying a prefix length for a column that uses a multibyte character set.

How to repeat:
The Sub_part section of the SHOW INDEX Syntax page does not contain the note about bytes vs characters.

Suggested fix:
Since the (N) always means bytes and sometimes characters the wording on both pages should be changed from using characters to using bytes.

The note from the Column Indexes page should be added to the SHOW INDEX Syntax page.

The number of indexed *bytes* if the column is only partly indexed, NULL if the entire column is indexed.

With col_name(N) syntax in an index specification, you can create an index that uses only the first N *bytes* of a string column.
[1 Sep 2015 18:17] MySQL Verification Team
I do not see where is the documentation wrong. You just have to follow the  instructions closely. 

Specifically, when our manual describe limits, it describes it in bytes and not characters, simply because limits are defined in the number of bytes. Those have to be bytes, because these are hard limits used in our code for indexing. Or in other words:

"Prefix limits are measured in bytes, while the prefix length in CREATE TABLE statements is interpreted as number of characters. Take this into account when specifying a prefix length for a column that uses a multibyte character set."

Hence, it is user's obligation to calculate how many characters can fit in order not to surpass limits. And number of bytes per character are specific for each different character set.
[2 Sep 2015 2:47] monty solomon
The documentation is wrong/incomplete/inconsistent on the "SHOW INDEX Syntax" page where it states that the sub_part is the number of indexed characters if the column is only partly indexed.

On that page it does not mention anything about characters versus bytes and therefore the user would not necessarily know that the number of indexed characters is actually less than the displayed sub_part for any character sets that use more than one byte per character.

For clarity, the note on the "Column Indexes" page should be added to the sub_part section of the "SHOW INDEX Syntax" page.

The terminology used on those two pages differ. On the "Column Indexes" page it is referred to as a prefix and on the "SHOW INDEX Syntax" page it is referred to as sub_part.
[2 Sep 2015 13:24] MySQL Verification Team
These are two insignificant changes.

First one would involve reference to the chapter on the character sets and second one would change one term for another.

These are very, very small changes, but they would lead to some hardly noticeable improvements, but still improvements.

Verified.
[19 Feb 2016 16:30] Paul DuBois
Thank you for your bug report. This issue has been addressed in the documentation. The updated documentation will appear on our website shortly.