Bug #28925 | GROUP_CONCAT inserts wrong separators for a ucs2 column | ||
---|---|---|---|
Submitted: | 6 Jun 2007 12:22 | Modified: | 3 Jul 2007 19:47 |
Reporter: | Alexander Barkov | Email Updates: | |
Status: | Closed | Impact on me: | |
Category: | MySQL Server: Charsets | Severity: | S3 (Non-critical) |
Version: | 5.0, 5.1 | OS: | Any |
Assigned to: | Alexander Barkov | CPU Architecture: | Any |
[6 Jun 2007 12:22]
Alexander Barkov
[6 Jun 2007 12:33]
MySQL Verification Team
Thank you for the bug report.
[7 Jun 2007 0:30]
Martin Friebe
patch + test
Attachment: gconcat_ucs.patch (text/x-patch), 2.34 KiB.
[7 Jun 2007 0:35]
Martin Friebe
This should also account for the opposite case. From my understanding of the code: - with charset_connection=ucs2 - and an explicit specified separator the separator it will be in ucs2. If the result string is not ucs2, this would have caused the same issue.
[7 Jun 2007 4:47]
Alexander Barkov
Martin, you're absolutely right about the opposite case. This test demonstrates the problem: set names latin1; drop table if exists t1; create table t1 (a char(1) character set latin1); insert into t1 values ('a'),('b'),('c'); set character_set_connection=ucs2; select hex(group_concat(a separator ',')) from t1; drop table t1; And its result is: +------------------------------------+ | hex(group_concat(a separator ',')) | +------------------------------------+ | 61002C62002C63 | +------------------------------------+ 1 row in set (0.00 sec) which is: 61 - letter 'a' in latin1 002C - comma in ucs2 62 - letter 'b' in latin1 00C2 - comma in ucs2 63 - comma in latin1 We got a mixture latin1 and ucs2 again, as you expected. Can you please verify that your fix covers this case as well, and cover it in the test. Many thanks! Also, there is another related mistake in sql_yacc.yy, in this piece of code: opt_gconcat_separator: /* empty */ { $$ = new (YYTHD->mem_root) String(",",1,default_charset_info); } |SEPARATOR_SYM text_string { $$ = $2; }; If mysqld is started with --default-character-set=ucs2, then the above code will initialize the string with ascii value, but will assign ucs2 character set to it, which is wrong. It should be &my_charset_latin1 instead of default_charset_info: opt_gconcat_separator: /* empty */ { $$ = new (YYTHD->mem_root) String(",", 1, &my_charset_latin1); } |SEPARATOR_SYM text_string { $$ = $2; }; This change should be included in the final patch as well.
[7 Jun 2007 5:23]
Alexander Barkov
Another possible way to fix this bug would be: - remove the "String *separator" member from the Item_func_group_concat class, - add "uint arg_count_separator" member - change the type of separator from *String to *Item: in sql_yacc.yy and in arguments to constructor Item_func_group_concat::Item_func_group_concat() - if separator is given explicitly in the query, pass it to constructor and initialize arg_count_separator to 1. - if separator is not given explicitly, pass NULL to constructor and set arg_count_separator to 0. - In constructor: put the separator (if it exists) into the "args" array, between the "fields" and the "order" arguments. After this change, the bug would be fixed automatically: The character set aggregation code in Item_func_group_concat::fix_fields would convert all necessary arguments (including the separator - if it presents) to the result character set: if (agg_item_charsets(collation, func_name(), args, /* skip charset aggregation for order columns */ arg_count - arg_count_order, MY_COLL_ALLOW_CONV, 1)) return 1; ... like it is done in all other string functions, e.g. CONCAT(). If separator is not given explicitly, then it could be initialized to comma *after* the above call, when we already know the result character set. This change would also give more benifits: - allow using character set introducers together with HEX, which would be very convenient for UCS2: GROUP_CONCAT(ucs2_column SEPARATOR _ucs2 0x002C) currently this is not possible, and produce syntax error :( - allow expressions in separator - not only just literals. - allow binding of separator in prepared statements Unfortunately, this is a lot of changes. Of course, we should avoid doing that in 5.0 or 5.1, for stability reasons. This is just an idea how it can be changed in long terms.
[7 Jun 2007 9:43]
Martin Friebe
Except for GROUP_CONCAT(ucs2_column SEPARATOR _ucs2 0x002C) is currently not a valid syntax. The Parser doesn't accept charset modificators for the SEPARATOR. Therefore the Separator would always have less priority than a column, or a constant with an explicit charset. But I agree, it would be a good thing to add this feature to the SEPARATOR.
[7 Jun 2007 16:25]
Martin Friebe
using item for separator
Attachment: gconcat_ucs.patch (text/x-patch), 6.93 KiB.
[7 Jun 2007 16:33]
Martin Friebe
As suggested by Alexander Barkov. The followig could be discussed: Item *separator; could also be Item_string *separator; but I see no benefit in that? Except the string could be extracted once in fix_fields, instead of calling separator->str_val(tmp) each time. On the other hand, this is more generic, and could also allow to have the seperator coming from a table-field? Or could be Item **separator; and point to the pointer in args. That would save the need of copying the (potentialy) changed args pointer back to the field (after agg_item_charsets)
[7 Jun 2007 18:16]
Martin Friebe
the initial patch enhanced (patch for mysql GA version)
Attachment: gconcat_ucs_string.patch (text/x-patch), 3.54 KiB.
[7 Jun 2007 18:23]
Martin Friebe
ok an overview of all recent posts * gconcat_ucs_string.patch [7 Jun 20:16] (the last patch submited, before this text) is the original patch with the additions suggested by Alexander: - added tests for connection_charset=ucs2 - fix of charset in parser * gconcat_ucs.patch [7 Jun 18:25] is the extended patch for later mysql versions. Implementing an Item as suggest by Alexander's 2nd post
[8 Jun 2007 5:41]
Alexander Barkov
The test case demonstrating problem when "mysqld --default-character-set=ucs2": create table t1 (a char(1) character set latin1); insert into t1 values ('a'),('b'),('c'); select hex(group_concat(a)) from t1; drop table t1; Result is: hex(group_concat(a)) 616263 I.e. there's no separator at all. This is well fixed by changing default_charset_info to &my_charset_bin here: opt_gconcat_separator: /* empty */ { $$ = new (YYTHD->mem_root) String(",",1,default_charset_info); } |SEPARATOR_SYM text_string { $$ = $2; };
[8 Jun 2007 6:06]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/28375 ChangeSet@1.2516, 2007-06-08 11:06:02+05:00, bar@mysql.com +8 -0 Bug#28925 GROUP_CONCAT inserts wrong separators for a ucs2 column Problem: separator was not converted to the result character set, so the result was a mixture of two different character sets, which was especially bad for UCS2. Fix: convert separator to the result character set. Many thanks for Martin Friebe for helping us again.
[8 Jun 2007 6:07]
Alexander Barkov
Sorry, there's a mistake in my previous post. The correct way is: This is well fixed by changing default_charset_info to &my_charset_latin1.
[8 Jun 2007 6:11]
Alexander Barkov
Dear Martin, Many thanks for your contribution! I have changed the patch a little bit: 1. Moved the test case into ctype_ucs, to avoid func_gconcat failures when no UCS2 is compiled. 2. Modified the internals of the new method String::convert_charset(), not to duplicate code which already exists. 3. Added a new test case into ctype_ucs2_def, to check that after the fix it works fine with --default-character-set=ucs2 Do you think that these changes are fine? Thanks!
[8 Jun 2007 6:33]
Alexander Barkov
Martin, many thanks for providing the patch for the "item" version. I added a new task for that, which should appear soon on http://fourge.mysql.com/ It refers to your patch.
[8 Jun 2007 8:05]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/28380 ChangeSet@1.2516, 2007-06-08 13:04:51+05:00, bar@mysql.com +9 -0 Bug#28925 GROUP_CONCAT inserts wrong separators for a ucs2 column Problem: separator was not converted to the result character set, so the result was a mixture of two different character sets, which was especially bad for UCS2. Fix: convert separator to the result character set. Many thanks for Martin Friebe for helping us again.
[8 Jun 2007 10:50]
Martin Friebe
Great. Had a look at the final version. An idea which is not related to the bug: After follow up the Strings internal mem handling. Wouldn't it be convenient to give it *optional* access to some of the root-memory pools (alloc_root)?
[22 Jun 2007 12:20]
Bugs System
A patch for this bug has been committed. After review, it may be pushed to the relevant source trees for release in the next version. You can access the patch from: http://lists.mysql.com/commits/29385 ChangeSet@1.2516, 2007-06-22 17:18:40+05:00, bar@mysql.com +6 -0 Bug#28925 GROUP_CONCAT inserts wrong separators for a ucs2 column Problem: separator was not converted to the result character set, so the result was a mixture of two different character sets, which was especially bad for UCS2. Fix: convert separator to the result character set.
[22 Jun 2007 13:33]
Alexander Barkov
Pushed into 5.0.46-rpl Pushed into 5.1.20-rpl
[3 Jul 2007 18:57]
Bugs System
Pushed into 5.0.46
[3 Jul 2007 18:57]
Bugs System
Pushed into 5.1.21-beta
[3 Jul 2007 19:47]
Paul DuBois
Noted in 5.0.46, 5.1.21 changelogs. For a ucs2 column, GROUP_CONCAT() did not convert separators to the result character set before inserting them, producing a result containing a mixture of two different character sets.