| Bug #77713 | dash '-' is not recognized in charset armscii8 on select where query | ||
|---|---|---|---|
| Submitted: | 14 Jul 2015 8:39 | Modified: | 26 Oct 2015 16:29 |
| Reporter: | Alexander Barkov | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
| Version: | 5.x, 5.5.46, 5.6.27 | OS: | Any |
| Assigned to: | CPU Architecture: | Any | |
[14 Jul 2015 9:18]
MySQL Verification Team
Hello! Thank you for the report and test case. Thanks, Umesh
[14 Jul 2015 9:21]
MySQL Verification Team
In order to submit contributions you must first sign the Oracle Contribution Agreement (OCA). For additional information please check http://www.oracle.com/technetwork/community/oca-486395.html.
[26 Oct 2015 16:29]
Paul DuBois
Noted in 5.7.10, 5.8.0 changelogs. Some punctuation characters in the armscii8 character set are represented by two encodings, with the result that a character stored using one encoding would not be found using a search with the other encoding. For such characters, MySQL now selects the encoding with the lowest value to consistently map instances onto the same encoding.

Description: A few punctuation characters cannot be found in a ARMSCII8 column. The affected characters are those that have double encoding. Encoding#1 Encoding#2 Unicode Character Name ---------- ---------- ------- -------------- 0x27 0xFF U+0027 APOSTROPHE 0x28 0xA5 U+0028 LEFT PARENTHESIS 0x29 0xA4 U+0029 RIGHT PARENTHESIS 0x2C 0xAB U+002C COMMA 0x2D 0xAC U+002D HYPHEN-MINUS 0x2E 0xA9 U+002E FULL STOP How to repeat: set names utf8; drop table if exists t1; create table t1(a varchar(64) CHARACTER SET armscii8); insert into t1 values ('abc-def'); select * from t1 where a = 'abc-def'; -> empty set Suggested fix: --- a/strings/ctype-simple.c +++ b/strings/ctype-simple.c @@ -1303,7 +1303,28 @@ create_fromuni(struct charset_info_st *cs, if (wc >= idx[i].uidx.from && wc <= idx[i].uidx.to && wc) { int ofs= wc - idx[i].uidx.from; - tab[ofs]= ch; + if (!tab[ofs] || tab[ofs] > 0x7F) /* Prefer ASCII*/ + { + /* + Some character sets can have double encoding. For example, + in ARMSCII8, the following characters are encoded twice: + + Encoding#1 Encoding#2 Unicode Character Name + ---------- ---------- ------- -------------- + 0x27 0xFF U+0027 APOSTROPHE + 0x28 0xA5 U+0028 LEFT PARENTHESIS + 0x29 0xA4 U+0029 RIGHT PARENTHESIS + 0x2C 0xAB U+002C COMMA + 0x2D 0xAC U+002D HYPHEN-MINUS + 0x2E 0xA9 U+002E FULL STOP + + That is, both 0x27 and 0xFF convert to Unicode U+0027. + When converting back from Unicode to ARMSCII, + we prefer the ASCII range, that is we want U+0027 + to convert to 0x27 rather than to 0xFF. + */ + tab[ofs]= ch; + } } } }