MySQL Bugs: #41632: Document how to create a CHAR / VARCHAR like type with fixed storage space.

Bug #41632	Document how to create a CHAR / VARCHAR like type with fixed storage space.
Submitted:	19 Dec 2008 11:14	Modified:	19 Feb 2009 7:12
Reporter:	Michael Evans	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server: Documentation	Severity:	S4 (Feature request)
Version:	5.0.67-0ubuntu6	OS:	Linux (Ubuntu 8.10 fully updated)
Assigned to:		CPU Architecture:	Any
Tags:	char, fixed, size, store, storeage, utf8, varchar, width

Description:
I am trying to create a -very- small table so that it will more easily fit in to memory and read more quickly off the disk. A few of various sizes, however the smallest should be the state/province table. It will literally have two integers (likely unsigned tiny-ints), and a 2 character field. Sure, I could just use ASCII for it, but what if support is later added for a code which is 1 UTF-8 character (and 2 bytes of storage) wide?

A better example; various name related fields which I want to store in UTF-8 and limit to a fixed number of characters on disk. I don't care if I can only store 13 of some complex foreign characters in to a field with length 40, if it's that complicated or rare I probably don't need 40 characters to store a unique entry.

At first glance I would -guess- that the BINARY / VARBINARY are what is intended to serve this function, possibly with a specification of a collation set to use with it.

How to repeat:
Read the manual: 9.5. Data Type Storage Requirements in my case.

Unable to find a storage type/specification method which occupies exactly N bytes per row that is similar to CHAR(N).

Suggested fix:
Update the manual ( 9.5. Data Type Storage Requirements from the local copy ) to note if BINARY can be used for this; AND/OR add syntax to specify size of objects either in bytes or assured characters.

Possible example: CHAR(BYTES(N)) would create an underlying buffer of N bytes, and store UP TO N characters in it, truncating at the last whole character for multi-character or variable length characters which do not fit in the buffer.
CHAR(N) would be short for CHAR(ASSURE(N)) or some similar syntax.

Many thanks for writing a feature request. We will discuss this.

Michael,

I got feedback from our development. Unfortunately, nobody of us really understands what you are meaning.

Do you have an example for us?

Do you just mean we should document:

CHAR(N)
BINARY(N)

N == byte length

What I want is either clarification from the documentation on -how- to use the existing features to do this (if possible); OR for features to do such a basic thing to be added.

Currently if I want to store a UTF8 stream in to a given column the column's width seems to be determined by (MaxCharWidth) * NormalLengthConsiderations.  So if I say CHAR(255) I end up with not 255 bytes of storage, but 765 bytes of storage; which most of the data I end up storing will only use 1/3rd of at -maximum-.

Lets say I were storing names, either login account names, or individual names.  In the case of VARCHAR storage (which is what I am using for individual names at the moment) the number of bytes stored should be exactly whatever it takes to store up to my limit of characters, plus the overhead for encoding that number; preferably they would be stored as the minimal length byte stream for that sequence of UTF-8 characters (say I had 200 normal characters and a single strange currency symbol or denotation marker or some kind which used 3 bytes; that would take up 203 bytes for the data portion of the stream, and then whatever the database had as overhead.)

However, in the case of the account-name most of my prospective users would use more or less normal ASCII plus maybe a few special marks.  Some however might use entirely different symbol sets; like for example Japanese.  It would be nice, to be able to say something like CHAR(BYTES(40)) to have a 40 byte wide field in the data-row, which would then be used to store the account name.  Most European language users would find that length quite comfortable.  Various escaped UTF-8 characters would limit it to, 13 whole characters.  Also not a number that seems too bad as a potential user name  Using a field width of 48 bytes would also work.

The advantage to building from byte storage up is that a database designer can make the table's data structure repeat at regular intervals, take less memory, and be less complex to search.

Sorry, but we still don't understand what exactly is your problem.

Are you looking for:
CHAR(n) with n usually means numbers of characters independent how much byte the single sign has

VARCHAR(m) with m usually means the bytes

I understood you are looking for CHAR(m) is that right?

Or do you look for CHAR(n) CHARACTER SET ascii?

The way that CHAR(n) currently works with variable width character sets like UTF-8 is that it creates a field n*(max sizeof(char)) in the table, at least that is what the documentation explains.

I want a way of setting that to -just- the n I specify.  That way if I am using utf-8 I can set it to a value like 40 or 48, and have 48 chars of space, which would then store floor(n/(max sizeof(char))) to floor(n/(min sizeof(char)) chars, depending on the complexity of the characters in the character set.  In the case of n=40 and utf-8 the values should be 13 to 40.  Such a data-type would make a lot of sense for information mostly, but not necessarily completely within, the standard 7 bit ascii table, like email addresses, usernames, etc.

The setting doesn't make quite as much sense for varchar, but it would be useful for setting an enforced buffer limit for arguments to procedures and etc.

Michael,

we still not really have an idea what you want.
Usually, when this happens, there is a thinking error.

We really can't follow your thoughts.

From all what we understood we think one of the following is the workaround for you:

- CHAR(n) CHARACTER SET ascii
- use BINARY(n) instead of CHAR(n)

For "I want the size in bytes not characters" our development already started discussing about an implementation for specifying character length in octet.

Maybe it will get clearer when you let us know:

What exactly should our documentation team add/change at the documentation?

Show us the documentation link and give us the text how you would change the documentation when you would be responsible for it.

Because our documentation team has non clue what it should change/add here and our data type/character set developers plua me also don't get it I am thinking about closing the report.

When the documentation mentions BINARY it sounds as if BINARY literally means there is no character-set in use.  That it would just store whatever octet-stream you give it until it's full.  That's very close to what I want, but the documentation isn't clear about the following.

Does BINARY(40) set a field of 40 octets that you may then attach an arbitrary collation to, which will then be used for processing on that field?

So if for a column definition I had...

name BINARY(48) CHARACTER SET utf8

(NOTE: file:///usr/share/doc/mysql-doc-5.0/refman-5.0-en.html-chapter/data-types.html#string-type-overview shows BINARY having syntax: BINARY(M) -only- )

Would that behave like

name CHAR(48) CHARACTER SET utf8

EXCEPT with the following behavior differences:

CHAR(48): Always storing 48 'characters', regardless of underlying storage requirements; in fact, probably reserving room for the worst-case update: 48*3 octets per row.

BINARY(48): Always reserving 48 octets, storing UP TO 48 characters if none are multi-byte, and storing at minimum 16 (Floor(48/3) == 16 for this example) characters if they are all the worst case allowed by the character-set in use.

I would expect a BINARY without a defined character-set to behave as if everything were compared with the BINARY cast operator; and otherwise to behave as a normal char/varchar type would given the same character-set.

The difference in storage seems like it would be 2/3rds, at least based on my current (mis?)understanding of how it operates.

BINARY does indeed have no character set; Octets are compared/sorted using numeric byte values.

However, when you want to treat the contents of a BINARY value using a particular character set/collation, you can use CONVERT() to convert the value to a non-binary string with the correct character set and optionally add a COLLATE clause to specify a collation (if the character set's default collation is not what you want). See:

http://dev.mysql.com/doc/refman/5.1/en/cast-functions.html
http://dev.mysql.com/doc/refman/5.1/en/charset-binary-collations.html

Note: A problem with BINARY is that you'll have the trailing 0x00 bytes included in the value. You might want to use VARBINARY instead.

Why can't I tell the database that's how I always want to access the data while creating/updating the table with the syntax I mentioned in the previous message?

It could interpret that as a CHAR datatype which is limited by underlying storage and not by outward character requirements.

It is given in SQL 2003 standard that length of CHARACTER VARYING (short VARCHAR) has to be given in characters and not in byte.

MySQL is just staying SQL standard conform here and so this is not a bug.

Does the standard forbid the other method I described?  Modifying BINARY to work as a storage container specification with an attached default interpretation (character set/collation type) so that it would then otherwise behave like a CHAR or VARCHAR?