Bug #32914 Character sets: illegal characters in utf8 and utf32 columns
Submitted: 2 Dec 2007 22:05 Modified: 29 Jul 2008 17:31
Reporter: Peter Gulutzan Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:6.0.4-alpha-debug OS:Linux (SUSE 10 64-bit)
Assigned to: Alexander Barkov CPU Architecture:Any

[2 Dec 2007 22:05] Peter Gulutzan
Description:
I'm using the new Unicode character sets with MySQL 6.0.

The specification was (see WL#1213):
"Validity checks are: ... in any encoding
the code point values must not be greater than 0x10ffff".

But I can insert characters >= code point value 10ffff
in utf8 and utf32 encodings.

How to repeat:
mysql> create table t (utf32 char(1) character set utf32, utf8 char(1) character set utf8);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t values (0x10ffff,0xf48fbfbf);
Query OK, 1 row affected (0.00 sec)

mysql> select hex(utf32),hex(utf8) from t;
+------------+-----------+
| hex(utf32) | hex(utf8) |
+------------+-----------+
| 0010FFFF   | F48FBFBF  |
+------------+-----------+
1 row in set (0.00 sec)
[2 Dec 2007 22:46] MySQL Verification Team
Thank you for the bug report. Verified as described.
[6 Dec 2007 9:35] Alexander Barkov
Peter, you say:

> I can insert characters >= code point value 10ffff
> in utf8 and utf32 encodings.

But then you insert U+10FFFF, which *IS* a valid character:

>
> How to repeat:
> mysql> create table t (utf32 char(1) character set utf32, utf8 char(1) character > set
> utf8);
> Query OK, 0 rows affected (0.01 sec)

Please clarify what the problem is.
[7 Dec 2007 18:42] Peter Gulutzan
I'm sorry, the 'how to repeat' indeed showed insertion
of maximum legal value, rather than insertion of
minimum illegal value. Here is a new 'how to repeat':

mysql> create table t (utf32 char(1) character set utf32, utf8 char(1) character set utf8);
Query OK, 0 rows affected (0.01 sec)

mysql> insert into t values (0x110000,0xf4908080);
Query OK, 1 row affected (0.00 sec)

mysql> select hex(utf32),hex(utf8) from t;
+------------+-----------+
| hex(utf32) | hex(utf8) |
+------------+-----------+
| 00110000   | F4908080  |
+------------+-----------+
1 row in set (0.00 sec)
[1 Apr 2008 15:10] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/44738

ChangeSet@1.2614, 2008-04-01 20:03:44+05:00, bar@mysql.com +6 -0
  Bug#32914 Character sets: illegal characters in utf8 and utf32 columns
  Problem: inserting of Unicode values higher than U+10FFFF was possible
  into utf32 and utf8 columns
  Fix:
  - my_mb_wc_utf8mb4() was not strict enough. Adding more strict rules.
  - well_formed_copy_nchars() didn't check if left ZERO PADDING
    generated a wrong character. Adding extra checking for the leftmost
    (padded) character.
[4 Apr 2008 5:51] Alexander Barkov
Pushed into 6.0.5-engines
[29 Jul 2008 3:09] Alexander Barkov
Appeared in bzr mysql-6.0.7-aplha.
[29 Jul 2008 17:31] Paul DuBois
Noted in 6.0.7 changelog.

It was possible to insert invalid Unicode characters (with code point
values greater than U+10FFFF) into utf8 and utf32 columns.
[13 Sep 2008 22:41] Bugs System
Pushed into 6.0.6-alpha  (revid:bar@mysql.com-20080715105907-h7yaof18afggvs7a) (version source revid:hakan@mysql.com-20080716105246-eg0utbybp122n2w9) (pib:3)