MySQL Bugs: #77713: dash '-' is not recognized in charset armscii8 on select where query

Bug #77713	dash '-' is not recognized in charset armscii8 on select where query
Submitted:	14 Jul 2015 8:39	Modified:	26 Oct 2015 16:29
Reporter:	Alexander Barkov	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.x, 5.5.46, 5.6.27	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
A few punctuation characters cannot be found in a ARMSCII8 column.
The affected characters are those that have double encoding.

Encoding#1 Encoding#2 Unicode Character Name
---------- ---------- ------- --------------
0x27       0xFF       U+0027  APOSTROPHE
0x28       0xA5       U+0028  LEFT PARENTHESIS
0x29       0xA4       U+0029  RIGHT PARENTHESIS
0x2C       0xAB       U+002C  COMMA
0x2D       0xAC       U+002D  HYPHEN-MINUS
0x2E       0xA9       U+002E  FULL STOP

How to repeat:
set names utf8;
drop table if exists t1;
create table t1(a varchar(64) CHARACTER SET armscii8);
insert into t1 values ('abc-def');
select * from t1 where a = 'abc-def';
-> empty set

Suggested fix:
--- a/strings/ctype-simple.c
+++ b/strings/ctype-simple.c
@@ -1303,7 +1303,28 @@ create_fromuni(struct charset_info_st *cs,
       if (wc >= idx[i].uidx.from && wc <= idx[i].uidx.to && wc)
       {
         int ofs= wc - idx[i].uidx.from;
-        tab[ofs]= ch;
+        if (!tab[ofs] || tab[ofs] > 0x7F) /* Prefer ASCII*/
+        {
+          /*
+            Some character sets can have double encoding. For example,
+            in ARMSCII8, the following characters are encoded twice:
+
+            Encoding#1 Encoding#2 Unicode Character Name
+            ---------- ---------- ------- --------------
+            0x27       0xFF       U+0027  APOSTROPHE
+            0x28       0xA5       U+0028  LEFT PARENTHESIS
+            0x29       0xA4       U+0029  RIGHT PARENTHESIS
+            0x2C       0xAB       U+002C  COMMA
+            0x2D       0xAC       U+002D  HYPHEN-MINUS
+            0x2E       0xA9       U+002E  FULL STOP
+
+            That is, both 0x27 and 0xFF convert to Unicode U+0027.
+            When converting back from Unicode to ARMSCII,
+            we prefer the ASCII range, that is we want U+0027
+            to convert to 0x27 rather than to 0xFF.
+          */
+          tab[ofs]= ch;
+        }
       }
     }
   }

Hello!

Thank you for the report and test case.

Thanks,
Umesh

In order to submit contributions you must first sign the Oracle Contribution Agreement (OCA).
For additional information please check http://www.oracle.com/technetwork/community/oca-486395.html.

Noted in 5.7.10, 5.8.0 changelogs.

Some punctuation characters in the armscii8 character set are
represented by two encodings, with the result that a character stored
using one encoding would not be found using a search with the other
encoding. For such characters, MySQL now selects the encoding with
the lowest value to consistently map instances onto the same
encoding.