Bug #43037 Character sets: upper() fails: dec8 cp850 latin1 koi8u keybcs2 macroman cp1256
Submitted: 19 Feb 2009 21:56 Modified: 20 Feb 2009 11:20
Reporter: Peter Gulutzan Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.1, 5.4, 5.5, 6.1.0-alpha OS:Linux (SUSE 10.0 / 32-bit)
Assigned to: Assigned Account CPU Architecture:Any

[19 Feb 2009 21:56] Peter Gulutzan
Description:
UPPER() and LOWER() functions should work the same in all character sets unless
(a) the appropriate letter is not in the character set all
(b) the letter is dotted or dotless 'i' in character set latin5 collate latin5_turkish_ci
(c) the character set is BINARY.
But in several character sets that is not the case.
dec8, cp850, latin1, koi8u, keybcs2, macroman, cp1256.
In total there are 98 affected characters.

For this bug report I only looked at non-CJK character sets, since I expect
Chinese/Japanese/Korean errors will be fixed separately by WL#4583.

There follows a detailed list with these columns:
ucs2                  The displayed value (it doesn't matter if you can't read this)
v_character_set_name  The name of the character set which has the problem character
HEX(ucs2)             The ucs2 hexadecimal value, that is, Unicode code point
HEX(c)                The same character, in v_character_set
HEX(c_upper)          What UPPER(c) value should be according to UCS2 rules
HEX(c_lower)          What LOWER(c) value should be according to UCS2 rules
HEX(UPPER(c))         What UPPER(c) actually returns for v_character_set
HEX(LOWER(c))         What LOWER(c) actually returns for v_character_set
For example, for the first item in the list (OE ligature in character set dec8),
the correct lower case value according to UCS2 rules is F7 (oe ligature), but in
fact LOWER(0xd7) returns 0xd7. You can confirm this with a statement like
SELECT HEX(LOWER(_dec8 0xd7));

+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| Œ   | dec8                 | 0152      | D7     | D7           | F7           | D7            | D7            |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (2.34 sec)

+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| œ   | dec8                 | 0153      | F7     | D7           | F7           | F7            | F7            |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (2.34 sec)

+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| à   | cp850                | 00E0      | 85     | B7           | 85           | 41            | 85            |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (5 min 58.44 sec)

+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| á   | cp850                | 00E1      | A0     | B5           | A0           | 41            | A0            |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (5 min 58.44 sec)

+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| ucs2 | v_character_set_name | HEX(ucs2) | HEX(c) | HEX(c_upper) | HEX(c_lower) | HEX(UPPER(c)) | HEX(LOWER(c)) |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
| â   | cp850                | 00E2      | 83     | B6           | 83           | 41            | 83            |
+------+----------------------+-----------+--------+--------------+--------------+---------------+---------------+
1 row in set (5 min 58.45 sec)

... and here I truncate. I will attach the complete list as a file.

How to repeat:
SELECT HEX(LOWER(_dec8 0xd7)); etc.
But I don't think it's necessary to verify for each character.
I think the only disputable item is 'i' / 'I' in latin5, which I
class as an error because MySQL follows the Turkish rules even
if collation is latin5_bin.
[19 Feb 2009 21:58] Peter Gulutzan
List of characters for Bug#43037

Attachment: bug43037.txt (text/plain), 55.74 KiB.

[20 Feb 2009 11:20] Susanne Ebrecht
Set to verified