Bug #44968 | Buffer length Vs column size for wide character resultset columns | ||
---|---|---|---|
Submitted: | 19 May 2009 22:04 | Modified: | 23 Dec 2009 23:37 |
Reporter: | Farid Zidan (Candidate Quality Contributor) | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | Connector / ODBC | Severity: | S5 (Performance) |
Version: | 5.01.05.00 | OS: | Windows (XP SP3) |
Assigned to: | Bogdan Degtyariov | CPU Architecture: | Any |
[19 May 2009 22:04]
Farid Zidan
[17 Dec 2009 7:43]
Bogdan Degtyariov
Hi Farid, Sorry it took so long to get to this bug report. You are right that the SQL_DESC_OCTET_LENGTH attribute is returned incorrectly. However, the correct buffer size is not column_size*2. Please note that the default character set for the connection is UTF-8. Therefore the driver should calculate the buffer size as column_size*3. UTF-8 is the variable-length character set and its characters can take from 1 to 3 bytes. Accordingly to MSDN we need to consider the maximum length in bytes. Nevertheless, this value does not include the null terminator, which is obviously added in the formula column_size*3 + 1 Marked as verified. The patch is coming soon.
[17 Dec 2009 10:05]
Bogdan Degtyariov
Patch to discuss
Attachment: patch44968.diff (text/x-diff), 588 bytes.
[17 Dec 2009 10:15]
Bogdan Degtyariov
MSDN says that the octet length should not include the null terminator: http://msdn.microsoft.com/en-us/library/ms713558%28VS.85%29.aspx
[17 Dec 2009 14:14]
Farid Zidan
Hi Bogdan, Since Windows client do not use UTF-8, the driver would need to return either single-byte (ANSI) or double byte (UCS-16) buffers to Windows client. So for ANSI char data, the octet buffer size is column length * 1 whereas for Unicode strings the octet would need to be string length * 2. I am not sure about other OSes that the driver may also target, but for Windows, the driver has to convert from UTF-8 to either ANSI or UCS-16 as noted above. Thanks.
[18 Dec 2009 3:45]
Bogdan Degtyariov
Farid, I think that Windows clients are not limited by only two possible encodings ANSI or UCS-2 (16 bit). However, I agree that SQLWCHAR data type is 16-bit in Windows. On the other hand the octet length of field_size*3 comes from the server metadata as the original field is UTF-8. Thus it requires the appropriate multiplier (3). To be completely correct SQL_DESC_OCTET_LENGTH does not represent the buffer size on the client side. As it is said in MSDN (http://msdn.microsoft.com/en-us/library/ms713558%28VS.85%29.aspx): - quote ------------- For fixed-length character or binary types, this is the actual length in bytes. For variable-length character or binary types, this is the maximum length in bytes. This value does not include the null terminator. - quote ------------- Please note that they mention variable-length character type. Neither ANSI nor UCS-2 are variable length, but UTF-8 and UTF-16 are variable length encodings. I consulted with the developers and they agree with this interpretation of SQL_DESC_OCTET_LENGTH attribute. If you need to calculate the buffer size for your application it is better to use SQL_DESC_LENGTH * sizeof(SQLWCHAR).
[18 Dec 2009 4:03]
Bogdan Degtyariov
Another small piece of information from Wikipedia about UCS-2 (http://en.wikipedia.org/wiki/UTF-16/UCS-2): UCS-2 (2-byte Universal Character Set) is a similar yet older character encoding that was superseded by UTF-16 in Unicode version 2.0, though it still remains in use. The UCS-2 encoding form is identical to that of UTF-16, except that it does not support surrogate pairs and therefore can only encode characters in the BMP range U+0000 through U+FFFF. As a consequence it is a fixed-length encoding that always encodes characters into a single 16-bit value.
[18 Dec 2009 19:50]
Farid Zidan
Hi Bogdan, Sorry for any confusion. What I meant was that Windows application do not use UTF-8 when calling ODBC API functions (in some parts of my app I use UTF-8 but never for calling ODBC API). ODBC API methods handling of character resultset data is either single-byte (ANSI) or double-byte (Unicode) how the resultset target column is bound (SQL_C_CHAR or SQL_C_WCHAR). So the MySQL ODBC driver converts from UTF-8 to char or wchar as requested by the application. Per ODBC doc SQL_DESC_OCTET_LENGTH is actual length of data returned by driver for data being passed to the application (this should always be divisable by 2 (single-byte for SQL_C_CHAR or double-bytes for SQL_C_WCHAR)). So if trigger name in MySQL is UTF-8 64 characters (can require up to 64*3 bytes of storage in MySQL database, the driver can at most return 64 (for SQL_C_CHAR) or 64*2 (for SQL_C_WCHAR) for the trigger name when it converts from UTF-8 to client ANSI codepage or UCS-16 on Windows. Here is the pertinent ODBC doc about SQL_DESC_OCTET_LENGTH: http://msdn.microsoft.com/en-us/library/ms713979(VS.85).aspx Transfer Octet Length The transfer octet length of a column is the maximum number of bytes returned to the application when data is transferred to its default C data type. For character data, the transfer octet length does not include space for the null-termination character. The transfer octet length of a column may be different than the number of bytes required to store the data on the data source I hope this explained this issue a little better. Thanks.
[18 Dec 2009 20:47]
Farid Zidan
Please ignore the part about transfer octet length being always divisable by 2 (applies for character data only when resultset column target type is SQL_C_WCHAR).
[23 Dec 2009 12:57]
Bogdan Degtyariov
Farid, Thanks for clarification. That makes sense.
[23 Dec 2009 15:06]
Farid Zidan
Hi Bogdan, My analysis of this issue assumes that the MySQL ODBC driver converts from UTF-8 to the client ANSI codepage when transferring the data of a UTF-8 column that is bound as SQL_C_CHAR, and to UTF-16 (no surrogates, or rather UCS-2) when transferring the data of a UTF-8 column that is bound as SQL_C_WCHAR. I haven't looked at the MySQL ODBC driver code, but if that's the case, then then yes, the driver would transfer at most x or x*2 bytes for a UTF-8 column that is defined as x characters wide in such a case. Thanks.
[23 Dec 2009 23:37]
Farid Zidan
Hi Bogdan, One reservation I had was about UTF-16 surrogates. After doing more research on the web http://unicode.org/faq/utf_bom.html#utf8-2 I am now more confident about my earlier assumption about the maximum number of bytes required to convert an x-byte long UTF-8 string to an ANSI codepage (max x bytes) or to UTF-16 string (max x*2 bytes). As the reference above notes a UTF-16 surrogate pair (4 bytes) would require 4 bytes to store in UTF-8. Here is the quoted text from the reference above: Q: How do I convert a UTF-16 surrogate pair such as <D800 DC00> to UTF-8? A one four byte sequence or as two separate 3-byte sequences? A: The definition of UTF-8 requires that supplementary characters (those using surrogate pairs in UTF-16) be encoded with a single four byte sequence. So to summarize, when UTF-8 column defined as size x characters is bound as SQL_C_CHAR the driver will at most transfer x bytes to client that will contain the result of converting the UTF-8 data to the client codepage. If the column is bound as SQL_C_WCHAR then the driver will transfer at most x*2 bytes to the client where data is the result of converting the UTF-8 data to UTF-16 (or UCS-2). For this example, the server may store the column in up to 4*x bytes at the server since UTF-8 encoding may use from 1 to 4 bytes to encode a single Unicode character.