Bug #57709 c-api-data-structures information now incomplete
Submitted: 25 Oct 2010 10:59 Modified: 9 Aug 2013 13:53
Reporter: Peter Laursen (Basic Quality Contributor) Email Updates:
Status: No Feedback Impact on me:
None 
Category:MySQL Server: C API (client library) Severity:S3 (Non-critical)
Version:4.1+ OS:Any
Assigned to: Georgi Kodinov
Tags: qc
Triage: Triaged: D5 (Feature request)

[25 Oct 2010 10:59] Peter Laursen
Description:
This report is about http://dev.mysql.com/doc/refman/5.0/en/c-api-data-structures.html 

'enum_field_types' and 'unsigned long length' evaluated together makes it possible to construct (client side) a TABLE DEFINITION from a result set. It is understandable that such table definition will be omitting some information (like indexes, engine, charset) but still datatypes are recoverable from the result set alone. Or rather they *were* recoverable.  Since introduction of unicode charsets in MySQL 4.1 it is not quite the case any more. 

The problem is "The server determines the length value before it generates the result set, so this is the minimum length required for a data type capable of holding the largest possible value from the result column, without knowing in advance the actual values that will be produced by the query for the result set.".  The length here is byte-lenght - not character-lenght. It is not possible reliably to identify the character width for char/varchar columns (from the result set alone - additional info from a SHOW statement is requried.

Several GUI clients (including phpMyAdmin and SQLyog - but probably more) has an option to 'export a result set as SQL'. It will often reult in 'wider' char/varchar columns than the original storage.  I only find it safe to create the column with the same character-width as the 'unsigned long length', but 'unsigned long length' may be 3 or 4 times (depending on server version) too big for this (as compared to the original storage) with utf8 data.

How to repeat:
see above

Suggested fix:
at least: 
clarify how is this "the largest possible value from the result column" evaluated? Using charset used for storage, character_set_connection or character_set_client (or a combination of those)? In my understanding it should be character_set_client.

at best: 
send information to client of width-per-character (or simply the charset) for string columns. There is not enough information now.
[25 Oct 2010 10:59] Peter Laursen
Fixed typo in synopsis.
[25 Oct 2010 11:04] Peter Laursen
BTW: Please clarify ASAP:

clarify how is this "the largest possible value from the result column" evaluated? Using charset used for storage, character_set_connection or character_set_client (or a combination of those)?
[5 Nov 2010 19:30] Shane Bester
this reminded me of not a bug 42488 I filed for prepared statements...
[5 Nov 2010 20:32] Peter Laursen
I am mystified!

I got a notification "Updated by: Shane Bester" but I do not see any new content added. What happened?  

If Shane added some internal consideration I cannot see, I should not have notification (but I think it would be more correct to let the original reporter see all communication) - but this is a hypothesis only of course!
[6 Nov 2010 13:33] Peter Laursen
I think simply one 'piece of information' is missing for every (var)char column in the result set, namely 'number of bytes used to represent a character'.  It is non-trivial (and may be slow) to send additional queries to the server when the result set is populated from many tables. 

(ie.  "SET NAMES latin1;" and read data stored as utf8.  Will a char(10) report 10 or 30 as length for the coulumn (answer is '30')? Next read data stored as latin1.  Same question (answer is '10').  Result set and client has no information about the "max. possible byte-width of storage" - only the server has.  Client will have to assume (var)char(30) utf8-case - what is wrong.)

I suggest you simply add this 'missing piece of information' to the data structure. It will be '1' then always for non-(var)char columns and for all single-byte charset with (var)char as well.

But more important right now: could somebody please tell the logic that the server uses now? Does it simply set byte_length = 3*character_length for data stored as utf8? It looks like it - but character_set_connection and/or character_set client may be 'narrower' - so the client cannot handle it properly without information from the server about 'number of bytes used to represent a character' (at least I cannot see any way to do).
[12 Nov 2010 19:25] Sveta Smirnova
Thank you for the report.

"clarify how is this "the largest possible value from the result column" evaluated? Using
charset used for storage, character_set_connection or character_set_client (or a
combination of those)? In my understanding it should be character_set_client." verified as documentation issue.
[12 Nov 2010 19:29] Peter Laursen
This is more than a documentation issue.  The problem is that *c-api-data-structures information now incomplete* (from 4.1).  Documentation cannot change that!
[16 Nov 2010 8:21] Stefan Hinz
As the bug reporter says, this is not a documentation bug. Please fix and/or indicate what exactly we should document (as a clarification/limitation/workaround).
[8 Sep 2012 9:24] Peter Laursen
Any chance that this could be fixed soon? This report is now 2 years old. :-(
[1 Jul 2013 13:40] Peter Laursen
there is a related discussion here:
http://bugs.mysql.com/bug.php?id=69627
[1 Jul 2013 15:57] Roland Bouman
Ok - I filed a separate bug for my findings:

http://bugs.mysql.com/bug.php?id=69630&thanks=2&notify=67

I'm not entirely sure I have full understanding of Peter's issue, but the metadata seem messed up to me.
[9 Jul 2013 13:53] Georgi Kodinov
Peter,

Provided the discussion in Ronald's bug (69630) do you see anything that isn't clear with how the field is filled ? 
In general, when it calculates the possible number of bytes it assumes the worst case : all characters taking the longest byte sequences as defined in the collation, e.g. for UTF8 that would be 3 bytes for each character (http://dev.mysql.com/doc/refman/5.6/en/charset-unicode-utf8.html).  
The precise formula used to calculate the max length is in Protocol::send_result_set_metadata : 

        /*
          For TEXT/BLOB columns, field_length describes the maximum data
          length in bytes. There is no limit to the number of characters
          that a TEXT column can store, as long as the data fits into
          the designated space.
          For the rest of textual columns, field_length is evaluated as
          char_count * mbmaxlen, where character count is taken from the
          definition of the column. In other words, the maximum number
          of characters here is limited by the column definition.

          When one has a LONG TEXT column with a single-byte
          character set, and the connection character set is multi-byte, the
          client may get fields longer than UINT_MAX32, due to
          <character set column> -> <character set connection> conversion.
          In that case column max length does not fit into the 4 bytes
          reserved for it in the protocol.
        */
        max_length= (field.type >= MYSQL_TYPE_TINY_BLOB &&
                     field.type <= MYSQL_TYPE_BLOB) ?
                     field.length / item->collation.collation->mbminlen :
                     field.length / item->collation.collation->mbmaxlen;
        field_length= char_to_byte_length_safe(max_length,
                                               thd_charset->mbmaxlen);
[10 Aug 2013 1:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".