Description:
At Wikipedia we have some data which contains high Unicode characters beyond the Basic Multilingual Plane (>65536). In UTF-8 encoding these take up 4 bytes; in UTF-16 they would be stored as a "surrogate pair" of two 16-bit pseudocharacters.
Currently MySQL's Unicode charset support doesn't seem to allow storing these characters in text-encoded fields when using UTF-8 to communicate to the server:
* ucs2 stores four question marks "????" in place of the char
* utf8 truncates the string at the point the char appears
Some quick testing indicates that I can store surrogate pseudocharacters if I explicitly code for them in pseudo-UTF-8, but this complicates communicating with the server.
How to repeat:
Tested with PHP 5.1.0RC1:
<?php
mysql_connect("localhost", "unitest");
mysql_select_db("unitest");
mysql_query("SET NAMES utf8");
mysql_query("CREATE TABLE demo(
wide VARCHAR(50) CHARACTER SET ucs2,
utf VARCHAR(50) CHARACTER SET utf8,
raw VARBINARY(50)
)");
$char = "\xf0\xa8\xa7\x80";
mysql_query("INSERT INTO demo(wide,utf,raw) VALUES ('$char','$char','$char')");
$result = mysql_query("SELECT * FROM demo");
$row = mysql_fetch_array($result, MYSQL_ASSOC);
foreach($row as $field => $val) {
$match = ($val == $char) ? "OK" : "FAILED";
print "$field: $match ($val)\n";
}
?>
Outputs:
wide: FAILED (????)
utf: FAILED ()
raw: OK (
Suggested fix:
A sufficient compromise for our purposes would be for the ucs2 charset to be enhanced (or a second utf16 charset made available) to do encoding and decoding of UTF-16 surrogate pairs when communicating with the server in UTF-8.
So if we send the UTF-8 string: "\xf0\xa8\xa7\x80"
It should interpret it as the UTF-16 string: "\ud862\uddc0"
And on select we should get back UTF-8: "\xf0\xa8\xa7\x80"
Continuing to use UCS-2 collation semantics would be good enough for what we need; this would allow us to use collatable Unicode text fields for page titles and usernames without rare but existing data getting corrupted.