MySQL Bugs: #14052: No UTF-16 charset encoding available (surrogate pairs)

Bug #14052	No UTF-16 charset encoding available (surrogate pairs)
Submitted:	15 Oct 2005 22:50	Modified:	1 Dec 2008 13:27
Reporter:	Brion Vibber	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server	Severity:	S4 (Feature request)
Version:	5.0.13	OS:	Linux (Linux)
Assigned to:		CPU Architecture:	Any

Description:
At Wikipedia we have some data which contains high Unicode characters beyond the Basic Multilingual Plane (>65536). In UTF-8 encoding these take up 4 bytes; in UTF-16 they would be stored as a "surrogate pair" of two 16-bit pseudocharacters.

Currently MySQL's Unicode charset support doesn't seem to allow storing these characters in text-encoded fields when using UTF-8 to communicate to the server:
* ucs2  stores four question marks "????" in place of the char
* utf8 truncates the string at the point the char appears

Some quick testing indicates that I can store surrogate pseudocharacters if I explicitly code for them in pseudo-UTF-8, but this complicates communicating with the server.

How to repeat:
Tested with PHP 5.1.0RC1:
<?php
mysql_connect("localhost", "unitest");
mysql_select_db("unitest");
mysql_query("SET NAMES utf8");
mysql_query("CREATE TABLE demo(
    wide VARCHAR(50) CHARACTER SET ucs2,
    utf VARCHAR(50) CHARACTER SET utf8,
    raw VARBINARY(50)
)");
$char = "\xf0\xa8\xa7\x80";
mysql_query("INSERT INTO demo(wide,utf,raw) VALUES ('$char','$char','$char')");
$result = mysql_query("SELECT * FROM demo");
$row = mysql_fetch_array($result, MYSQL_ASSOC);
foreach($row as $field => $val) {
    $match = ($val == $char) ? "OK" : "FAILED";
    print "$field: $match ($val)\n";
}
?>

Outputs:
wide: FAILED (????)
utf: FAILED ()
raw: OK (

Suggested fix:
A sufficient compromise for our purposes would be for the ucs2 charset to be enhanced (or a second utf16 charset made available) to do encoding and decoding of UTF-16 surrogate pairs when communicating with the server in UTF-8.

So if we send the UTF-8 string: "\xf0\xa8\xa7\x80"
It should interpret it as the UTF-16 string: "\ud862\uddc0"
And on select we should get back UTF-8: "\xf0\xa8\xa7\x80"

Continuing to use UCS-2 collation semantics would be good enough for what we need; this would allow us to use collatable Unicode text fields for page titles and usernames without rare but existing data getting corrupted.

See also Bug 25666.

This is part of MySQL 6.0 alpha.

Support for utf16 was added into mysql-6.0.
Closing the feature request.