| Bug #71995 | LDML collation size is limited | ||
|---|---|---|---|
| Submitted: | 10 Mar 2014 15:56 | Modified: | 31 Mar 2014 13:36 |
| Reporter: | Александр Евстигнеев | Email Updates: | |
| Status: | Closed | Impact on me: | |
| Category: | MySQL Server: Charsets | Severity: | S2 (Serious) |
| Version: | 5.5.36, 5.5.37 | OS: | Windows (Win7, 64 bit, FreeBSD 9.2) |
| Assigned to: | Georgi Kodinov | CPU Architecture: | Any |
| Tags: | charset, collation, LDML, utf8 | ||
[11 Mar 2014 8:52]
Hartmut Holzgraefe
It worked fine for me when I created "Unicode distinct case-insensitive" which is a pretty big one: http://www.skysql.com/blogs/hartmut/adding-case-insensitive-distinct-unicode-collation ftp://ftp.skysql.com/downloads/hartmut/utf8_distinct_ci.xml You may be running into one of these though: * Bug #65593 "parse errors in loadable UCA / LDML collations are silently ignored" http://bugs.mysql.com/bug.php?id=65593 * bug #68142 "UCA / LDML parser does not complain about invalid/unsupported backslash sequence" http://bugs.mysql.com/bug.php?id=68142 * bug #68143 "Validity of LDML collations is checked too late" http://bugs.mysql.com/bug.php?id=68143 * bug #68144 "Collation name missing from log messages about LDML definition problems" http://bugs.mysql.com/bug.php?id=68144
[11 Mar 2014 11:18]
Александр Евстигнеев
Looks like you are right, but need to do some additional testing. In case you are right, there is a mistake in my LDML data, which prevents it from parsing from some position. Is there any way to find and fix it? I've checked manually and found nothing...
[15 Mar 2014 11:49]
Александр Евстигнеев
UTF text file to define characters ordering
Attachment: utfsource.txt (text/plain), 475 bytes.
[15 Mar 2014 11:50]
Александр Евстигнеев
Ruleset, generated from UTF source file
Attachment: rules.txt (text/plain), 3.33 KiB.
[15 Mar 2014 12:00]
Александр Евстигнеев
Collation testing query
Attachment: test.sql (application/octet-stream, text), 19.99 KiB.
[15 Mar 2014 12:01]
Александр Евстигнеев
Ok, i've done some additional testings and research. 1) Your huge collation file from ftp://ftp.skysql.com/downloads/hartmut/utf8_distinct_ci.xml not working in version i've specified. Tested on Windows 7 64-bits. Symptoms are the same. 2) I wrote a little script in Perl that converts file like: http://bugs.mysql.com/file.php?id=21172&bug_id=71995 To the rules for collation files. Here is what i've got: http://bugs.mysql.com/file.php?id=21173&bug_id=71995 (in my case collation named utf_test_ci and have id = 253) So it's pretty correct and there are no human mistakes. Also i've wrote a script, that generates a test query from the same file: http://bugs.mysql.com/file.php?id=21174&bug_id=71995 What we have: It seems that only first 96 rules are working fine. Others are ignored. If you comment few 20 rules without breaking format, you'll see that next 20 rules will start working. So now i'm really sure about kinda buffering problem or rules number limit.
[15 Mar 2014 12:32]
Александр Евстигнеев
Tested this collation section with mySQL 5.6.16 - Works great, so it's defenitely a problem of 5.5.36
[15 Mar 2014 12:34]
Александр Евстигнеев
Ocassionaly changed version and OSes of original ticket. Sorry.
[17 Mar 2014 20:14]
Sveta Smirnova
Thank you for the report. Verified as described. Version 5.6 is not affected.

Description: It seems to me, that there is a very tight buffer for LDML collation instructions from Index.xml When I defining just few rules, they works fine, but, if i'll add a lot of rules before, the lates stops working. How to repeat: My task was to force mysql to treat different variations of latin letters as different letters. 1) Insert collation into Index.xml (in utf8 section): <collation name="utf8_test_ci" id="253"> <rules> <reset>O</reset> <p>\u00D6</p><t>\u00F6</t> </rules> </collation> 2) Restart mySQL server 3) Execute query: select "O" = "Ö" as n ,"O" collate utf8_test_ci = "Ö" as c; +---+---+ | n | c | +---+---+ | 1 | 0 | +---+---+ Works fine! 4) Now replace collation part with a large block: <collation name="utf8_test_ci" id="253"> <rules> <reset>\u0415</reset> <p>\u0401</p><t>\u0451</t> <reset>A</reset> <p>\u00C0</p><t>\u00E0</t> <p>\u00C1</p><t>\u00E1</t> <p>\u00C2</p><t>\u00E2</t> <p>\u00C3</p><t>\u00E3</t> <p>\u00C4</p><t>\u00E4</t> <p>\u00C5</p><t>\u00E5</t> <p>\u00C6</p><t>\u00E6</t> <p>\u0100</p><t>\u0101</t> <p>\u0102</p><t>\u0103</t> <p>\u0104</p><t>\u0105</t> <reset>C</reset> <p>\u00C7</p><t>\u00E7</t> <p>\u0106</p><t>\u0107</t> <p>\u0108</p><t>\u0109</t> <p>\u010a</p><t>\u010b</t> <p>\u010c</p><t>\u010d</t> <reset>D</reset> <p>\u00D0</p><t>\u00F0</t> <p>\u010E</p><t>\u010F</t> <p>\u0110</p><t>\u0111</t> <reset>E</reset> <p>\u00C8</p><t>\u00E8</t> <p>\u00C9</p><t>\u00E9</t> <p>\u00CA</p><t>\u00EA</t> <p>\u00CB</p><t>\u00EB</t> <p>\u0112</p><t>\u0113</t> <p>\u0114</p><t>\u0115</t> <p>\u0116</p><t>\u0117</t> <p>\u0118</p><t>\u0119</t> <p>\u011A</p><t>\u011B</t> <reset>G</reset> <p>\u011C</p><t>\u011D</t> <p>\u011E</p><t>\u011F</t> <p>\u0120</p><t>\u0121</t> <p>\u0122</p><t>\u0123</t> <reset>H</reset> <p>\u0124</p><t>\u0125</t> <p>\u0126</p><t>\u0127</t> <reset>I</reset> <p>\u00CC</p><t>\u00EC</t> <p>\u00CD</p><t>\u00ED</t> <p>\u00CE</p><t>\u00EE</t> <p>\u00CF</p><t>\u00EF</t> <p>\u0128</p><t>\u0129</t> <p>\u012A</p><t>\u012B</t> <p>\u012C</p><t>\u012D</t> <p>\u012E</p><t>\u012F</t> <p>\u0130</p><t>\u0131</t> <reset>J</reset> <p>\u0132</p><t>\u0133</t> <p>\u0134</p><t>\u0135</t> <reset>K</reset> <p>\u0136</p><t>\u0137</t> <p>\u0138</p> <reset>L</reset> <p>\u0139</p><t>\u013A</t> <p>\u013B</p><t>\u013C</t> <p>\u013D</p><t>\u013E</t> <p>\u013F</p><t>\u0140</t> <p>\u0141</p><t>\u0142</t> <reset>N</reset> <p>\u00D1</p><t>\u00F1</t> <p>\u0143</p><t>\u0144</t> <p>\u0145</p><t>\u0146</t> <p>\u0147</p><t>\u0148</t> <p>\u0149</p> <p>\u014A</p><t>\u014B</t> <reset>O</reset> <p>\u00D2</p><t>\u00F2</t> <p>\u00D3</p><t>\u00F3</t> <p>\u00D4</p><t>\u00F4</t> <p>\u00D5</p><t>\u00F5</t> <p>\u00D6</p><t>\u00F6</t> <p>\u00D8</p><t>\u00F8</t> <p>\u014C</p><t>\u014D</t> <p>\u014E</p><t>\u014F</t> <p>\u0150</p><t>\u0151</t> <p>\u0152</p><t>\u0153</t> <reset>R</reset> <p>\u0154</p><t>\u0155</t> <p>\u0156</p><t>\u0157</t> <p>\u0158</p><t>\u0159</t> <reset>S</reset> <p>\u1e9e</p><t>\u00DF</t> <p>\u015A</p><t>\u015B</t> <p>\u015C</p><t>\u015D</t> <p>\u015E</p><t>\u015F</t> <p>\u0160</p><t>\u0161</t> <reset>T</reset> <p>\u0162</p><t>\u0163</t> <p>\u0164</p><t>\u0165</t> <p>\u0166</p><t>\u0167</t> <reset>U</reset> <p>\u00D9</p><t>\u00F9</t> <p>\u00DA</p><t>\u00FA</t> <p>\u00DB</p><t>\u00FB</t> <p>\u00DC</p><t>\u00FC</t> <p>\u0168</p><t>\u0169</t> <p>\u016A</p><t>\u016B</t> <p>\u016C</p><t>\u016D</t> <p>\u016E</p><t>\u016F</t> <p>\u0170</p><t>\u0171</t> <p>\u0172</p><t>\u0173</t> <reset>W</reset> <p>\u0174</p><t>\u0175</t> <reset>Y</reset> <p>\u00DD</p><t>\u00FD</t> <p>\u0176</p><t>\u0177</t> <p>\u0178</p> <reset>Z</reset> <p>\u0179</p><t>\u017A</t> <p>\u017B</p><t>\u017C</t> <p>\u017D</p><t>\u017E</t> </rules> </collation> Restart mySQL server and try query again. Collation won't work. Also no error messages appears in server log or during query. Some tests shows, that first rules works fine, but from some point their stops. It seems that mySQL reads some instructions and skips other. Codes for testing can be checked here: http://www.fileformat.info/info/unicode/char/0136/index.htm http://www.fileformat.info/info/unicode/char/0136/index.htm Suggested fix: Not sure how it works internally, but if there is a buffer, it should be increased. Drastically.