Bug #65593 parse errors in loadable UCA / LDML collations are silently ignored
Submitted: 12 Jun 2012 21:22 Modified: 22 Jan 2013 15:20
Reporter: Hartmut Holzgraefe Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S3 (Non-critical)
Version:5.5.21, 5.5.23 OS:Any
Assigned to: CPU Architecture:Any

[12 Jun 2012 21:22] Hartmut Holzgraefe
Description:
When defining a UCA collation using LDML syntax in share/charsets/Index.xml any syntax errors in the collation definitions lead to the collation not being available after mysqld restart without providing any startup error message about the parse failure whatsoever.

How to repeat:
* Add the utf8_phone_ci example from 
http://dev.mysql.com/doc/refman/5.5/en/ldml-collation-example.html

* restart mysqld and verify that columns using utf8_phone_ci can be used

* now add a parse error, e.g. by simply removing the 'u' after the backslash in one of the unicode code point definitions, like replacing  

  <reset>\u0000</reset>

with

  <reset>\0000</reset>

* restart the server once more

* verify that utf8_phone_ci can't be used anymore

* check the mysqld error log for any collation related error message

=> there is none

Suggested fix:
Report errors found while parsing the loadable collations during startup to the mysqld error log
[13 Jun 2012 6:07] Valeriy Kravchuk
Thank you for the bug report. Verified with 5.5.23 on Windows also:

...
mysql> SELECT * FROM phonebook ORDER BY phone;
+-------+--------------------+
| name  | phone              |
+-------+--------------------+
| Sanja | +380 (912) 8008005 |
| Bar   | +7-912-800-80-01   |
| Svoj  | +7 912 800 80 02   |
| Ramil | (7912) 800 80 03   |
| Hf    | +7 (912) 800 80 04 |
+-------+--------------------+
5 rows in set (0.03 sec)

mysql> exit
Bye

C:\Program Files\MySQL\MySQL Server 5.5\bin>net stop mysql55
The MySQL55 service is stopping.
The MySQL55 service was stopped successfully.

C:\Program Files\MySQL\MySQL Server 5.5\bin>net start mysql55
The MySQL55 service is starting.
The MySQL55 service was started successfully.

C:\Program Files\MySQL\MySQL Server 5.5\bin>mysql -uroot -proot -P3312 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 1
Server version: 5.5.23 MySQL Community Server (GPL)

Copyright (c) 2000, 2011, Oracle and/or its affiliates. All rights reserved.

Oracle is a registered trademark of Oracle Corporation and/or its
affiliates. Other names may be trademarks of their respective
owners.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

mysql> SELECT * FROM phonebook ORDER BY phone;
ERROR 1273 (HY000): Unknown collation 'utf8_phone_ci' in table 'phonebook' defin
ition

But no errors in the error log:

120613  9:04:55 [Note] Event Scheduler: Loaded 0 events
120613  9:04:55 [Note] C:\Program Files\MySQL\MySQL Server 5.5\bin\mysqld: ready
 for connections.
Version: '5.5.23'  socket: ''  port: 3312  MySQL Community Server (GPL)
[29 Jul 2012 23:17] Paul DuBois
Noted in 5.6.6 changelog.

Parse errors that occurred while loading UCA or LDML collation
descriptions were not written to the error log.
[15 Jan 2013 12:58] Hartmut Holzgraefe
While this is fixed for actual XML parsing problems (e.g. wrong tag spelling) it still isn't for the given "how to reproduce" example

* \0000 (with missing 'u') is accepted even though http://dev.mysql.com/doc/refman/5.6/en/ldml-rules.html says that character names can be written literally or in \u#### format only ... i'm not sure how \0 is going to be interpreted here, maybe it is somehow valid but not mentioned?

* Anyway, the collation using just that single <reset>\0000</reset> rule is accepted, it shows in SHOW COLLATION just fine, but when actually trying to use it:

> SHOW COLLATION LIKE 'utf8_test';
+-----------+---------+-----+---------+----------+---------+
| Collation | Charset | Id  | Default | Compiled | Sortlen |
+-----------+---------+-----+---------+----------+---------+
| utf8_test | utf8    | 253 |         |          |       8 |
+-----------+---------+-----+---------+----------+---------+

> create table t1(id int primary key,d char collate utf8_revdig_ci);
ERROR 1273 (HY000): Unknown collation: 'utf8_revdig_ci'

The only extra error message now is a very obscure 

  Shift expected at ''

both in the output of SHOW WARNINGS and in the error log ...
[15 Jan 2013 13:01] Hartmut Holzgraefe
the utf8_test / utf8_revdig name mismatch was a copy/paste error on my side ... in the actual test cases the name was either utf8_test or utf8_revdig consistently ...
[15 Jan 2013 13:13] Hartmut Holzgraefe
Ok, the actual error is that no shift rules (<p>,<s>, <t>) are given after the reset rule, regardless of its content, the same effect can be seen when using 

  <collation name="utf8_test" id="253">
    <rules>
       <reset>A</reset>
    </rules>
  </collation>

So things come down to these distinct problems:

* it is not clear whether a backslash in front of anything else but a 'u' is valid at all, and how it is interpreted if it is indeed valid syntax ...

* a <reset> not followed by a shift rule is not supported, but is reported in a very obscure way at best (neither mentioning the name of the collation nor the name of the <reset> rule, so effectively just saying "somethings wrong somewhere ... or so ..."

* validity of collations is not checked at load time but only later at use time
[16 Jan 2013 13:18] Erlend Dahl
Hartmut, if you still have concerns, please file a new bug. Continuing the discussion here will just make us lose track of the issue.
[22 Jan 2013 15:20] Hartmut Holzgraefe
Ok, refiled as

* bug #68142 "UCA / LDML parser does not complain about invalid/unsupported backslash sequence"

* bug #68143 "Validity of LDML collations is checked too late"

* bug #68144 "Collation name missing from log messages about LDML definition problems"