Bug #44968 Buffer length Vs column size for wide character resultset columns
Submitted: 19 May 2009 22:04 Modified: 23 Dec 2009 23:37
Reporter: Farid Zidan (Candidate Quality Contributor) Email Updates:
Status: Verified Impact on me:
Category:Connector / ODBC Severity:S5 (Performance)
Version: OS:Microsoft Windows (XP SP3)
Assigned to: Bogdan Degtyariov CPU Architecture:Any
Triage: D2 (Serious)

[19 May 2009 22:04] Farid Zidan
Buffer length for wide char resultsets columns is a little too big for the reported column size. Usually for wide char columns buffer length and column size do not fail this assert:
ASSERT( nBufferLength == pRow->column_size * sizeof(wchar_t) );

rc = SQLColAttribute( hstmt, nCol,SQL_DESC_LENGTH, NULL,0, NULL,
(SQLLEN*) &pRow->column_size );

        // buffer length
        rc = SQLColAttribute( hstmt,
                              &pRow->buffer_length );

How to repeat:

Describe the resultset using above two api calls I get the following column sizes: 64 and 6 and buffer sizes are reported as 193 and 19 respectively.
trigger_name	varchar(64)	SQL_WVARCHAR
action_timing	varchar(6)	SQL_WVARCHAR

Is this by design? seems that reported buffer size is (column_size*3 + 1) instead of the usual column_size*2 for wide chars
[17 Dec 2009 7:43] Bogdan Degtyariov
Hi Farid,

Sorry it took so long to get to this bug report.

You are right that the SQL_DESC_OCTET_LENGTH attribute is returned incorrectly.
However, the correct buffer size is not column_size*2. Please note that the default character set for the connection is UTF-8. Therefore the driver should calculate the buffer size as column_size*3. UTF-8 is the variable-length character set and its characters can take from 1 to 3 bytes. Accordingly to MSDN we need to consider the maximum length in bytes. Nevertheless, this value does not include the null terminator, which is obviously added in the formula column_size*3 + 1

Marked as verified.
The patch is coming soon.
[17 Dec 2009 10:05] Bogdan Degtyariov
Patch to discuss

Attachment: patch44968.diff (text/x-diff), 588 bytes.

[17 Dec 2009 10:15] Bogdan Degtyariov
MSDN says that the octet length should not include the null terminator:

[17 Dec 2009 14:14] Farid Zidan
Hi Bogdan,

Since Windows client do not use UTF-8, the driver would need to return either single-byte (ANSI) or double byte (UCS-16) buffers to Windows client. So for ANSI char data, the octet buffer size is column length * 1 whereas for Unicode strings the octet would need to be string length * 2. I am not sure about other OSes that the driver may also target, but for Windows, the driver has to convert from UTF-8 to either ANSI or UCS-16 as noted above. Thanks.
[18 Dec 2009 3:45] Bogdan Degtyariov

I think that Windows clients are not limited by only two possible encodings ANSI or UCS-2 (16 bit). However, I agree that SQLWCHAR data type is 16-bit in Windows.

On the other hand the octet length of field_size*3 comes from the server metadata as the original field is UTF-8. Thus it requires the appropriate multiplier (3). To be completely correct SQL_DESC_OCTET_LENGTH does not represent the buffer size on the client side. As it is said in MSDN (http://msdn.microsoft.com/en-us/library/ms713558%28VS.85%29.aspx):

- quote -------------
For fixed-length character or binary types, this is the actual length in bytes. For variable-length character or binary types, this is the maximum length in bytes. This value does not include the null terminator.
- quote -------------

Please note that they mention variable-length character type. Neither ANSI nor UCS-2 are variable length, but UTF-8 and UTF-16 are variable length encodings.

I consulted with the developers and they agree with this interpretation of SQL_DESC_OCTET_LENGTH attribute. If you need to calculate the buffer size for your application it is better to use SQL_DESC_LENGTH * sizeof(SQLWCHAR).
[18 Dec 2009 4:03] Bogdan Degtyariov
Another small piece of information from Wikipedia about UCS-2 (http://en.wikipedia.org/wiki/UTF-16/UCS-2):

UCS-2 (2-byte Universal Character Set) is a similar yet older character encoding that was superseded by UTF-16 in Unicode version 2.0, though it still remains in use. The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value.
[18 Dec 2009 19:50] Farid Zidan
Hi Bogdan,

Sorry for any confusion. What I meant was that Windows application do not use UTF-8 when calling ODBC API functions (in some parts of my app I use UTF-8 but never for calling ODBC API). ODBC API methods handling of character resultset data is either single-byte (ANSI) or double-byte (Unicode) how the resultset target column is bound (SQL_C_CHAR or SQL_C_WCHAR). So the MySQL ODBC driver converts from UTF-8 to char or wchar as requested by the application.

Per ODBC doc SQL_DESC_OCTET_LENGTH is actual length of data returned by driver for data being passed to the application (this should always be divisable by 2 (single-byte for SQL_C_CHAR or double-bytes for SQL_C_WCHAR)). So if trigger name in MySQL is UTF-8 64 characters (can require up to 64*3 bytes of storage in MySQL database, the driver can at most return 64 (for SQL_C_CHAR) or 64*2 (for SQL_C_WCHAR) for the trigger name when it converts from UTF-8 to client ANSI codepage or UCS-16 on Windows.

Here is the pertinent ODBC doc about SQL_DESC_OCTET_LENGTH:
Transfer Octet Length
The transfer octet length of a column is the maximum number of bytes returned to the application when data is transferred to its default C data type. For character data, the transfer octet length does not include space for the null-termination character. The transfer octet length of a column may be different than the number of bytes required to store the data on the data source

I hope this explained this issue a little better. Thanks.
[18 Dec 2009 20:47] Farid Zidan
Please ignore the part about transfer octet length being always divisable by 2 (applies for character data only when resultset column target type is SQL_C_WCHAR).
[23 Dec 2009 12:57] Bogdan Degtyariov

Thanks for clarification. That makes sense.
[23 Dec 2009 15:06] Farid Zidan
Hi Bogdan,

My analysis of this issue assumes that the MySQL ODBC driver converts from UTF-8 to the client ANSI codepage when transferring the data of a UTF-8 column that is bound as SQL_C_CHAR, and to UTF-16 (no surrogates, or rather UCS-2) when transferring the data of a UTF-8 column that is bound as SQL_C_WCHAR. I haven't looked at the MySQL ODBC driver code, but if that's the case, then then yes, the driver would transfer at most x or x*2 bytes for a UTF-8 column that is defined as x characters wide in such a case.

[23 Dec 2009 23:37] Farid Zidan
Hi Bogdan,

One reservation I had was about UTF-16 surrogates. After doing more research on the web http://unicode.org/faq/utf_bom.html#utf8-2 I am now more confident about my earlier assumption about the maximum number of bytes required to convert an x-byte long UTF-8 string to an ANSI codepage (max x bytes) or to UTF-16 string (max x*2 bytes). 

As the reference above notes a UTF-16 surrogate pair (4 bytes) would require 4 bytes to store in UTF-8.

Here is the quoted text from the reference above:
Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8? A one four byte sequence or as two separate 3-byte sequences?

A: The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence.

So to summarize, when UTF-8 column defined as size x characters is bound as SQL_C_CHAR the driver will at most transfer x bytes to client that will contain the result of converting the UTF-8 data to the client codepage. If the column is bound as SQL_C_WCHAR then the driver will transfer at most x*2 bytes to the client where data is the result of converting the UTF-8 data to UTF-16 (or UCS-2). For this example, the server may store the column in up to 4*x bytes at the server since UTF-8 encoding may use from 1 to 4 bytes to encode a single Unicode character.