Bug #28925 GROUP_CONCAT inserts wrong separators for a ucs2 column
Submitted: 6 Jun 2007 12:22 Modified: 3 Jul 2007 19:47
Reporter: Alexander Barkov Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.0, 5.1 OS:Any
Assigned to: Alexander Barkov CPU Architecture:Any

[6 Jun 2007 12:22] Alexander Barkov
Description:
When processing an ucs2 column,
GROUP_CONCAT uses 0x2C as a separator.
The correct separator should be 0x002C instead.

How to repeat:
mysql> drop table if exists t1;
Query OK, 0 rows affected (0.00 sec)

mysql> create table t1 (a char(1) character set ucs2);
Query OK, 0 rows affected (0.00 sec)

mysql> insert into t1 values ('a'),('b'),('c');
Query OK, 3 rows affected (0.00 sec)
Records: 3  Duplicates: 0  Warnings: 0

mysql> select hex(group_concat(a)) from t1;
+----------------------+
| hex(group_concat(a)) |
+----------------------+
| 00612C00622C0063     |
+----------------------+
1 row in set (0.00 sec)

Suggested fix:
The above result is wrong, because it is a mixture of
two-byte characters and one-byte separators.

The expected result is: 0061 002C 0062 002C 0063

I.e. all characters, including separators (comma), must occupy two bytes.

0061 - letter a
002C - comma
0062 - letter b
002C - comma
0063 - letter c
[6 Jun 2007 12:33] MySQL Verification Team
Thank you for the bug report.
[7 Jun 2007 0:30] Martin Friebe
patch + test

Attachment: gconcat_ucs.patch (text/x-patch), 2.34 KiB.

[7 Jun 2007 0:35] Martin Friebe
This should also account for the opposite case.

From my understanding of the code:
- with charset_connection=ucs2
- and an explicit specified separator
the separator it will be in ucs2.
If the result string is not ucs2, this would have caused the same issue.
[7 Jun 2007 4:47] Alexander Barkov
Martin, you're absolutely right about the opposite case.
This test demonstrates the problem:

set names latin1;
drop table if exists t1;
create table t1 (a char(1) character set latin1);
insert into t1 values ('a'),('b'),('c');
set character_set_connection=ucs2;
select hex(group_concat(a separator ',')) from t1;
drop table t1;

And its result is:

+------------------------------------+
| hex(group_concat(a separator ',')) |
+------------------------------------+
| 61002C62002C63                     |
+------------------------------------+
1 row in set (0.00 sec)

which is:
61   - letter 'a' in latin1
002C - comma in ucs2
62   - letter 'b' in latin1
00C2 - comma in ucs2
63   - comma in latin1

We got a mixture latin1 and ucs2 again, as you expected.

Can you please verify that your fix covers this case as well,
and cover it in the test.

Many thanks!

Also, there is another related mistake in sql_yacc.yy, in this piece of code:

opt_gconcat_separator:
    /* empty */        { $$ = new (YYTHD->mem_root) String(",",1,default_charset_info); }
    |SEPARATOR_SYM text_string  { $$ = $2; };

If mysqld is started with --default-character-set=ucs2, then the above
code will initialize the string with ascii value, but will assign ucs2
character set to it, which is wrong.
It should be &my_charset_latin1 instead of default_charset_info:

opt_gconcat_separator:
    /* empty */        { $$ = new (YYTHD->mem_root) String(",", 1, &my_charset_latin1); }
    |SEPARATOR_SYM text_string  { $$ = $2; };

This change should be included in the final patch as well.
[7 Jun 2007 5:23] Alexander Barkov
Another possible way to fix this bug would be:

- remove the "String *separator" member from the Item_func_group_concat class,
- add "uint arg_count_separator" member
- change the type of separator from *String to *Item:
  in sql_yacc.yy  and in arguments to constructor 
  Item_func_group_concat::Item_func_group_concat()
- if separator is given explicitly in the query, pass it to constructor and
  initialize arg_count_separator to 1.
- if separator is not given explicitly, pass NULL to constructor 
  and set arg_count_separator to 0.
- In constructor: put the separator (if it exists) into the "args" array,
  between the "fields" and the "order" arguments.

After this change, the bug would be fixed automatically:

The character set aggregation code in Item_func_group_concat::fix_fields would
convert all necessary arguments (including the separator - if it presents)
to the result character set:

 if (agg_item_charsets(collation, func_name(),
                        args,
                        /* skip charset aggregation for order columns */
                        arg_count - arg_count_order,
                        MY_COLL_ALLOW_CONV, 1))
    return 1;

... like it is done in all other string functions, e.g. CONCAT().

If separator is not given explicitly, then it could be initialized
to comma *after* the above call, when we already know the result character set.

This change would also give more benifits:

- allow using character set introducers together with HEX,
  which would be very convenient for UCS2:

  GROUP_CONCAT(ucs2_column SEPARATOR _ucs2 0x002C)

  currently this is not possible, and produce syntax error :(

- allow expressions in separator - not only just literals.

- allow binding of separator in prepared statements

Unfortunately, this is a lot of changes.
Of course, we should avoid doing that in 5.0 or 5.1, for stability reasons.

This is just an idea how it can be changed in long terms.
[7 Jun 2007 9:43] Martin Friebe
Except for
 GROUP_CONCAT(ucs2_column SEPARATOR _ucs2 0x002C)
is currently not a valid syntax.
The Parser doesn't accept charset modificators for the SEPARATOR.

Therefore the Separator would always have less priority than a column, or a constant with an explicit charset.

But I agree, it would be a good thing to add this feature to the SEPARATOR.
[7 Jun 2007 16:25] Martin Friebe
using item for separator

Attachment: gconcat_ucs.patch (text/x-patch), 6.93 KiB.

[7 Jun 2007 16:33] Martin Friebe
As suggested by Alexander Barkov.

The followig could be discussed:
  Item *separator;

could also be 
  Item_string *separator;

but I see no benefit in that? Except the string could be extracted once in fix_fields, instead of calling separator->str_val(tmp) each time.
On the other hand, this is more generic, and could also allow to have the seperator coming from a table-field?

Or could be
  Item **separator;
and point to the pointer in args. That would save the need of copying the (potentialy) changed args pointer back to the field (after agg_item_charsets)
[7 Jun 2007 18:16] Martin Friebe
the initial patch enhanced (patch for mysql GA version)

Attachment: gconcat_ucs_string.patch (text/x-patch), 3.54 KiB.

[7 Jun 2007 18:23] Martin Friebe
ok an overview of all recent posts

* gconcat_ucs_string.patch
  [7 Jun 20:16]  (the last patch submited, before this text)
is the original patch with the additions suggested by Alexander:
- added tests for connection_charset=ucs2
- fix of charset in parser

* gconcat_ucs.patch 
  [7 Jun 18:25]
is the extended patch for later mysql versions. Implementing an Item as suggest by Alexander's 2nd post
[8 Jun 2007 5:41] Alexander Barkov
The test case demonstrating problem when
"mysqld --default-character-set=ucs2":

create table t1 (a char(1) character set latin1);
insert into t1 values ('a'),('b'),('c');
select hex(group_concat(a)) from t1;
drop table t1;

Result is:

hex(group_concat(a))
616263

I.e. there's no separator at all.

This is well fixed by changing default_charset_info
to &my_charset_bin here:

opt_gconcat_separator:
    /* empty */        { $$ = new (YYTHD->mem_root) String(",",1,default_charset_info); }
    |SEPARATOR_SYM text_string  { $$ = $2; };
[8 Jun 2007 6:06] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28375

ChangeSet@1.2516, 2007-06-08 11:06:02+05:00, bar@mysql.com +8 -0
  Bug#28925 GROUP_CONCAT inserts wrong separators for a ucs2 column
  Problem: separator was not converted to the result character set,
  so the result was a mixture of two different character sets,
  which was especially bad for UCS2.
  Fix: convert separator to the result character set.
  
  Many thanks for Martin Friebe for helping us again.
[8 Jun 2007 6:07] Alexander Barkov
Sorry, there's a mistake in my previous post.
The correct way is:

This is well fixed by changing default_charset_info to &my_charset_latin1.
[8 Jun 2007 6:11] Alexander Barkov
Dear Martin,

Many thanks for your contribution!

I have changed the patch a little bit:

1. Moved the test case into ctype_ucs,
  to avoid func_gconcat failures when no UCS2 is compiled.

2. Modified the internals of the new method String::convert_charset(),
  not to duplicate code which already exists.

3. Added a new test case into ctype_ucs2_def, to check
  that after the fix it works fine with --default-character-set=ucs2

Do you think that these changes are fine?
Thanks!
[8 Jun 2007 6:33] Alexander Barkov
Martin, 

many thanks for providing the patch for the "item" version.
I added a new task for that, which should appear soon on http://fourge.mysql.com/
It refers to your patch.
[8 Jun 2007 8:05] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/28380

ChangeSet@1.2516, 2007-06-08 13:04:51+05:00, bar@mysql.com +9 -0
  Bug#28925 GROUP_CONCAT inserts wrong separators for a ucs2 column
  Problem: separator was not converted to the result character set,
  so the result was a mixture of two different character sets,
  which was especially bad for UCS2.
  Fix: convert separator to the result character set.
  
  Many thanks for Martin Friebe for helping us again.
[8 Jun 2007 10:50] Martin Friebe
Great. Had a look at the final version.

An idea which is not related to the bug:
After follow up the Strings internal mem handling. Wouldn't it be convenient to give it *optional* access to some of the root-memory pools (alloc_root)?
[22 Jun 2007 12:20] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/29385

ChangeSet@1.2516, 2007-06-22 17:18:40+05:00, bar@mysql.com +6 -0
  Bug#28925 GROUP_CONCAT inserts wrong separators for a ucs2 column
  Problem: separator was not converted to the result character set,
  so the result was a mixture of two different character sets,
  which was especially bad for UCS2.
  Fix: convert separator to the result character set.
[22 Jun 2007 13:33] Alexander Barkov
Pushed into 5.0.46-rpl
Pushed into 5.1.20-rpl
[3 Jul 2007 18:57] Bugs System
Pushed into 5.0.46
[3 Jul 2007 18:57] Bugs System
Pushed into 5.1.21-beta
[3 Jul 2007 19:47] Paul DuBois
Noted in 5.0.46, 5.1.21 changelogs.

For a ucs2 column, GROUP_CONCAT() did not convert separators to the
result character set before inserting them, producing a result
containing a mixture of two different character sets.