Bug #14052 No UTF-16 charset encoding available (surrogate pairs)
Submitted: 15 Oct 2005 22:50 Modified: 1 Dec 2008 13:27
Reporter: Brion Vibber Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S4 (Feature request)
Version:5.0.13 OS:Linux (Linux)
Assigned to: CPU Architecture:Any

[15 Oct 2005 22:50] Brion Vibber
Description:
At Wikipedia we have some data which contains high Unicode characters beyond the Basic Multilingual Plane (>65536). In UTF-8 encoding these take up 4 bytes; in UTF-16 they would be stored as a "surrogate pair" of two 16-bit pseudocharacters.

Currently MySQL's Unicode charset support doesn't seem to allow storing these characters in text-encoded fields when using UTF-8 to communicate to the server:
* ucs2  stores four question marks "????" in place of the char
* utf8 truncates the string at the point the char appears

Some quick testing indicates that I can store surrogate pseudocharacters if I explicitly code for them in pseudo-UTF-8, but this complicates communicating with the server.

How to repeat:
Tested with PHP 5.1.0RC1:
<?php
mysql_connect("localhost", "unitest");
mysql_select_db("unitest");
mysql_query("SET NAMES utf8");
mysql_query("CREATE TABLE demo(
    wide VARCHAR(50) CHARACTER SET ucs2,
    utf VARCHAR(50) CHARACTER SET utf8,
    raw VARBINARY(50)
)");
$char = "\xf0\xa8\xa7\x80";
mysql_query("INSERT INTO demo(wide,utf,raw) VALUES ('$char','$char','$char')");
$result = mysql_query("SELECT * FROM demo");
$row = mysql_fetch_array($result, MYSQL_ASSOC);
foreach($row as $field => $val) {
    $match = ($val == $char) ? "OK" : "FAILED";
    print "$field: $match ($val)\n";
}
?>

Outputs:
wide: FAILED (????)
utf: FAILED ()
raw: OK (

Suggested fix:
A sufficient compromise for our purposes would be for the ucs2 charset to be enhanced (or a second utf16 charset made available) to do encoding and decoding of UTF-16 surrogate pairs when communicating with the server in UTF-8.

So if we send the UTF-8 string: "\xf0\xa8\xa7\x80"
It should interpret it as the UTF-16 string: "\ud862\uddc0"
And on select we should get back UTF-8: "\xf0\xa8\xa7\x80"

Continuing to use UCS-2 collation semantics would be good enough for what we need; this would allow us to use collatable Unicode text fields for page titles and usernames without rare but existing data getting corrupted.
[19 Jan 2007 11:21] [ name withheld ]
See also Bug 25666.
[7 Feb 2008 18:58] Peter Gulutzan
This is part of MySQL 6.0 alpha.
[1 Dec 2008 13:27] Alexander Barkov
Support for utf16 was added into mysql-6.0.
Closing the feature request.