Description:
UPPER() and LOWER() functions should work the same in all character sets unless
(a) the appropriate letter is not in the character set all
(b) the letter is dotted or dotless 'i' in character set latin5 collate latin5_turkish_ci
(c) the character set is BINARY.
But in several character sets that is not the case.
dec8, cp850, latin1, koi8u, keybcs2, macroman, cp1256.
In total there are 98 affected characters.
For this bug report I only looked at non-CJK character sets, since I expect
Chinese/Japanese/Korean errors will be fixed separately by WL#4583.
There follows a detailed list with these columns:
ucs2 The displayed value (it doesn't matter if you can't read this)
v_character_set_name The name of the character set which has the problem character
HEX(ucs2) The ucs2 hexadecimal value, that is, Unicode code point
HEX(c) The same character, in v_character_set
HEX(c_upper) What UPPER(c) value should be according to UCS2 rules
HEX(c_lower) What LOWER(c) value should be according to UCS2 rules
HEX(UPPER(c)) What UPPER(c) actually returns for v_character_set
HEX(LOWER(c)) What LOWER(c) actually returns for v_character_set
For example, for the first item in the list (OE ligature in character set dec8),
the correct lower case value according to UCS2 rules is F7 (oe ligature), but in
fact LOWER(0xd7) returns 0xd7. You can confirm this with a statement like
SELECT HEX(LOWER(_dec8 0xd7));
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| Œ | dec8 | 0152 | D7 | D7 | F7 | D7 | D7 |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (2.34 sec)
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| œ | dec8 | 0153 | F7 | D7 | F7 | F7 | F7 |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (2.34 sec)
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| à | cp850 | 00E0 | 85 | B7 | 85 | 41 | 85 |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (5 min 58.44 sec)
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| á | cp850 | 00E1 | A0 | B5 | A0 | 41 | A0 |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (5 min 58.44 sec)
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| â | cp850 | 00E2 | 83 | B6 | 83 | 41 | 83 |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (5 min 58.45 sec)
... and here I truncate. I will attach the complete list as a file.
How to repeat:
SELECT HEX(LOWER(_dec8 0xd7)); etc.
But I don't think it's necessary to verify for each character.
I think the only disputable item is 'i' / 'I' in latin5, which I
class as an error because MySQL follows the Turkish rules even
if collation is latin5_bin.