Bug #71995 LDML collation size is limited
Submitted: 10 Mar 2014 15:56 Modified: 31 Mar 2014 13:36
Reporter: Александр Евстигнеев Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Charsets Severity:S2 (Serious)
Version:5.5.36, 5.5.37 OS:Windows (Win7, 64 bit, FreeBSD 9.2)
Assigned to: Georgi Kodinov CPU Architecture:Any
Tags: charset, collation, LDML, utf8

[10 Mar 2014 15:56] Александр Евстигнеев
Description:
It seems to me, that there is a very tight buffer for LDML collation instructions from Index.xml

When I defining just few rules, they works fine, but, if i'll add a lot of rules before, the lates stops working.

How to repeat:
My task was to force mysql to treat different variations of latin letters as different letters.

1) Insert collation into Index.xml (in utf8 section):

<collation name="utf8_test_ci" id="253">
    <rules>
        <reset>O</reset>
                <p>\u00D6</p><t>\u00F6</t>
    </rules>
</collation>

2) Restart mySQL server
3) Execute query: 
select 
"O" = "Ö" as n
,"O" collate utf8_test_ci = "Ö" as c;

+---+---+
| n | c |
+---+---+
| 1 | 0 |
+---+---+

Works fine!

4) Now replace collation part with a large block: 

<collation name="utf8_test_ci" id="253">
	<rules>
		<reset>\u0415</reset>
			<p>\u0401</p><t>\u0451</t>
		<reset>A</reset>
			<p>\u00C0</p><t>\u00E0</t>
			<p>\u00C1</p><t>\u00E1</t>
			<p>\u00C2</p><t>\u00E2</t>
			<p>\u00C3</p><t>\u00E3</t>
			<p>\u00C4</p><t>\u00E4</t>
			<p>\u00C5</p><t>\u00E5</t>
			<p>\u00C6</p><t>\u00E6</t>
			<p>\u0100</p><t>\u0101</t>
			<p>\u0102</p><t>\u0103</t>
			<p>\u0104</p><t>\u0105</t>
		<reset>C</reset>
			<p>\u00C7</p><t>\u00E7</t>
			<p>\u0106</p><t>\u0107</t>
			<p>\u0108</p><t>\u0109</t>
			<p>\u010a</p><t>\u010b</t>
			<p>\u010c</p><t>\u010d</t>
		<reset>D</reset>
			<p>\u00D0</p><t>\u00F0</t>
			<p>\u010E</p><t>\u010F</t>
			<p>\u0110</p><t>\u0111</t>
		<reset>E</reset>
			<p>\u00C8</p><t>\u00E8</t>
			<p>\u00C9</p><t>\u00E9</t>
			<p>\u00CA</p><t>\u00EA</t>
			<p>\u00CB</p><t>\u00EB</t>
			<p>\u0112</p><t>\u0113</t>
			<p>\u0114</p><t>\u0115</t>
			<p>\u0116</p><t>\u0117</t>
			<p>\u0118</p><t>\u0119</t>
			<p>\u011A</p><t>\u011B</t>
		<reset>G</reset>
			<p>\u011C</p><t>\u011D</t>
			<p>\u011E</p><t>\u011F</t>
			<p>\u0120</p><t>\u0121</t>
			<p>\u0122</p><t>\u0123</t>
		<reset>H</reset>
			<p>\u0124</p><t>\u0125</t>
			<p>\u0126</p><t>\u0127</t>
		<reset>I</reset>
			<p>\u00CC</p><t>\u00EC</t>
			<p>\u00CD</p><t>\u00ED</t>
			<p>\u00CE</p><t>\u00EE</t>
			<p>\u00CF</p><t>\u00EF</t>
			<p>\u0128</p><t>\u0129</t>
			<p>\u012A</p><t>\u012B</t>
			<p>\u012C</p><t>\u012D</t>
			<p>\u012E</p><t>\u012F</t>
			<p>\u0130</p><t>\u0131</t>
		<reset>J</reset>
			<p>\u0132</p><t>\u0133</t>
			<p>\u0134</p><t>\u0135</t>
		<reset>K</reset>
			<p>\u0136</p><t>\u0137</t>
			<p>\u0138</p>
		<reset>L</reset>
			<p>\u0139</p><t>\u013A</t>
			<p>\u013B</p><t>\u013C</t>
			<p>\u013D</p><t>\u013E</t>
			<p>\u013F</p><t>\u0140</t>
			<p>\u0141</p><t>\u0142</t>
		<reset>N</reset>
			<p>\u00D1</p><t>\u00F1</t>
			<p>\u0143</p><t>\u0144</t>
			<p>\u0145</p><t>\u0146</t>
			<p>\u0147</p><t>\u0148</t>
			<p>\u0149</p>
			<p>\u014A</p><t>\u014B</t>
		<reset>O</reset>
			<p>\u00D2</p><t>\u00F2</t>
			<p>\u00D3</p><t>\u00F3</t>
			<p>\u00D4</p><t>\u00F4</t>
			<p>\u00D5</p><t>\u00F5</t>
			<p>\u00D6</p><t>\u00F6</t>
			<p>\u00D8</p><t>\u00F8</t>
			<p>\u014C</p><t>\u014D</t>
			<p>\u014E</p><t>\u014F</t>
			<p>\u0150</p><t>\u0151</t>
			<p>\u0152</p><t>\u0153</t>
		<reset>R</reset>
			<p>\u0154</p><t>\u0155</t>
			<p>\u0156</p><t>\u0157</t>
			<p>\u0158</p><t>\u0159</t>
		<reset>S</reset>
			<p>\u1e9e</p><t>\u00DF</t>
			<p>\u015A</p><t>\u015B</t>
			<p>\u015C</p><t>\u015D</t>
			<p>\u015E</p><t>\u015F</t>
			<p>\u0160</p><t>\u0161</t>
		<reset>T</reset>
			<p>\u0162</p><t>\u0163</t>
			<p>\u0164</p><t>\u0165</t>
			<p>\u0166</p><t>\u0167</t>
		<reset>U</reset>
			<p>\u00D9</p><t>\u00F9</t>
			<p>\u00DA</p><t>\u00FA</t>
			<p>\u00DB</p><t>\u00FB</t>
			<p>\u00DC</p><t>\u00FC</t>
			<p>\u0168</p><t>\u0169</t>
			<p>\u016A</p><t>\u016B</t>
			<p>\u016C</p><t>\u016D</t>
			<p>\u016E</p><t>\u016F</t>
			<p>\u0170</p><t>\u0171</t>
			<p>\u0172</p><t>\u0173</t>
		<reset>W</reset>
			<p>\u0174</p><t>\u0175</t>
		<reset>Y</reset>
			<p>\u00DD</p><t>\u00FD</t>
			<p>\u0176</p><t>\u0177</t>
			<p>\u0178</p>
		<reset>Z</reset>
			<p>\u0179</p><t>\u017A</t>
			<p>\u017B</p><t>\u017C</t>
			<p>\u017D</p><t>\u017E</t>
	</rules>
</collation>

Restart mySQL server and try query again. Collation won't work. Also no error messages appears in server log or during query.

Some tests shows, that first rules works fine, but from some point their stops. It seems that mySQL reads some instructions and skips other.

Codes for testing can be checked here: 
http://www.fileformat.info/info/unicode/char/0136/index.htm
http://www.fileformat.info/info/unicode/char/0136/index.htm

Suggested fix:
Not sure how it works internally, but if there is a buffer, it should be increased. Drastically.
[11 Mar 2014 8:52] Hartmut Holzgraefe
It worked fine for me when I created "Unicode distinct case-insensitive" which is a pretty big one:

http://www.skysql.com/blogs/hartmut/adding-case-insensitive-distinct-unicode-collation

ftp://ftp.skysql.com/downloads/hartmut/utf8_distinct_ci.xml

You may be running into one of these though:

* Bug #65593 "parse errors in loadable UCA / LDML collations are silently ignored"
http://bugs.mysql.com/bug.php?id=65593

* bug #68142 "UCA / LDML parser does not complain about invalid/unsupported backslash sequence"
http://bugs.mysql.com/bug.php?id=68142

* bug #68143 "Validity of LDML collations is checked too late"
http://bugs.mysql.com/bug.php?id=68143

* bug #68144 "Collation name missing from log messages about LDML definition problems"
http://bugs.mysql.com/bug.php?id=68144
[11 Mar 2014 11:18] Александр Евстигнеев
Looks like you are right, but need to do some additional testing. 

In case you are right, there is a mistake in my LDML data, which prevents it from parsing from some position. Is there any way to find and fix it? I've checked manually and found nothing...
[15 Mar 2014 11:49] Александр Евстигнеев
UTF text file to define characters ordering

Attachment: utfsource.txt (text/plain), 475 bytes.

[15 Mar 2014 11:50] Александр Евстигнеев
Ruleset, generated from UTF source file

Attachment: rules.txt (text/plain), 3.33 KiB.

[15 Mar 2014 12:00] Александр Евстигнеев
Collation testing query

Attachment: test.sql (application/octet-stream, text), 19.99 KiB.

[15 Mar 2014 12:01] Александр Евстигнеев
Ok, i've done some additional testings and research.

1) Your huge collation file from 
ftp://ftp.skysql.com/downloads/hartmut/utf8_distinct_ci.xml
not working in version i've specified. 
Tested on Windows 7 64-bits. 
Symptoms are the same.

2) I wrote a little script in Perl that converts file like:

http://bugs.mysql.com/file.php?id=21172&bug_id=71995

To the rules for collation files. Here is what i've got:

http://bugs.mysql.com/file.php?id=21173&bug_id=71995

(in my case collation named utf_test_ci and have id = 253)

So it's pretty correct and there are no human mistakes.

Also i've wrote a script, that generates a test query from the same file:

http://bugs.mysql.com/file.php?id=21174&bug_id=71995

What we have:

It seems that only first 96 rules are working fine. Others are ignored.

If you comment few 20 rules without breaking format, you'll see that next 20 rules will start working.

So now i'm really sure about kinda buffering problem or rules number limit.
[15 Mar 2014 12:32] Александр Евстигнеев
Tested this collation section with mySQL 5.6.16 - Works great, so it's defenitely a problem of 5.5.36
[15 Mar 2014 12:34] Александр Евстигнеев
Ocassionaly changed version and OSes of original ticket. Sorry.
[17 Mar 2014 20:14] Sveta Smirnova
Thank you for the report.

Verified as described. Version 5.6 is not affected.
[31 Mar 2014 13:36] Georgi Kodinov
Thank you for your bug report. This issue has already been fixed in the latest released version of that product, which you can download at

  http://www.mysql.com/downloads/

Fixed in 5.6 by WL#2673 and WL#5624. Pleae consider upgrading to 5.6