Bug #38356 | myisam_ftdump output of unicode characters such as 'ć' unreadable | ||
---|---|---|---|
Submitted: | 24 Jul 2008 21:04 | Modified: | 18 Jul 2014 19:44 |
Reporter: | Miguel K | Email Updates: | |
Status: | Verified | Impact on me: | |
Category: | MySQL Server: FULLTEXT search | Severity: | S3 (Non-critical) |
Version: | 5.5 | OS: | Any |
Assigned to: | CPU Architecture: | Any |
[24 Jul 2008 21:04]
Miguel K
[25 Jul 2008 9:17]
Susanne Ebrecht
Verified with actual 5.0 bzr tree. I used utf8 terminal. SET NAMES UTF; CREATE database bug38356 character set utf8; USE bug38356; CREATE table t(c char(100) character set utf8, v varchar(100) character set utf8, t text character set utf8); CREATE fulltext index ic on t(c); CREATE fulltext index iv on t(v); CREATE fulltext index it on t(t); INSERT into t values('bär','bär','bär'),('ćwiczyć','ćwiczyć','ćwiczyć'); SELECT length(c), length(v), length(t) from t; Length is 4 for the 'bär' and 9 for the 'ćwiczyć'. So data are stored in UTF8. $ ./bin/myisam_ftdump /home/myhome/mysql50bzr/var/bug38356/t 0 Longest word: 9 chars (��wiczy��) My terminal is still set to utf8 .... The same result with index num 1 and 2. Now I switched terminal to ISO-8859-15: Longest word: 9 chars (äwiczyä) Why the hell an 'ä'? also this two rectangles aren't displayed in my terminal I just could see them after copy paste here. It should be 'ćwiczyć' and not 'äwiczyä' The output is neither utf8 nor latin1 ... it looks like a double encoded utf8.
[25 Jul 2008 9:20]
Susanne Ebrecht
Same behaviour on MySQL 5.1 bzr tree.
[25 Jul 2008 15:57]
Miguel K
After a little more investigation, I see that it is outputting the Hex value of that letter instead of the utf8 value. SELECT HEX('ć'); gives C487 Looking at the hex output of the myisam_ftdump, you see that it is encoded as C487 as well.
[1 Dec 2009 21:49]
Michael Baldwin
I'm still experiencing this bug, but I'm not finding characters to be their simple hex values. Rather, two-bit sequences are having their third bit set. For example, "ä" is unicode 228 (binary 11100100). To make UTF-8, we substitue the x's and y's: 110yyyxx 10xxxxxx To get: 11000011 10100100 But what MySQL is putting in the files is: 11100011 10100100 I wrote my own utility to change the initial "111" to "110" in all two-byte sequences which start with a byte >127, and it fixed all the characters in a variety of European languages to be correct UTF-8.
[1 Dec 2009 22:58]
Sergei Golubchik
it's because of my_casedn_str() in myisam_ftdump.c, which was supposed to make the output nicer, but only works for latint1 :(
[11 Feb 2013 16:01]
Miguel K
Wow, nearly five years later, I am running into the same bug! It doesn't seem like a difficult one to fix. Still a problem in 5.5.
[18 Jul 2014 19:44]
Miguel K
After 6 years, I finally have a work-around!!! The problem is that it outputs it in the wrong encoding and additionally has changed the wrong encoding to lowercase. UTF-8 badly encoded as Latin1 would show "niño" as: niño But this program outputs the à as an ã: niã±o The work-around is to change the text to uppercase, convert it, and then back to lowercase. It only took me six years to figure it out. I do it in Notepad++ Highlight text > Edit > Convert to > Uppercase Encoding > Encode in UTF-8 Highlight text > Edit > Convert to > Lowercase If you have imported it into MySQL, use this (Spanish, Greek, Arabic below): select lower(convert(binary convert(upper("niã±a") using latin1) using utf8)); select lower(convert(binary convert(upper("îºî»î¬ïˆî±") using latin1) using utf8)); select lower(convert(binary convert(upper("øªøø¯ùŠø¯") using latin1) using utf8)); The scary thing is that this might be how full-text index data for utf8_unicode_ci is stored in mysql: lowercase versions of a misencoding to latin1. Happy sixth birthday, bug!
[11 Nov 2015 21:55]
Pavlos Ponos
I'm currently have the same issue in 5.6. I exported the myisam_ftdump from a full-text index I had and saved it in a txt file. txt file has the latin1 encoding and converted to utf-8 with the following command: iconv -f Latin1 -t utf-8 ~/Desktop/test.txt > ~/Desktop/1.txt By using the following command it perfectly worked for cases where the character was not present. lower(convert(binary convert(upper("Χαλκιδική") using latin1) using utf8)) By feeding back the mysql with all instances one-by-one (or with a script), I can read the index entries, yet what about the ones with the which seems to be totally corrupted? Any idea or workaround is more than welcome!
[11 Nov 2015 22:01]
Pavlos Ponos
Obviously the corrupted character was not displayed, you can think about a square symbol with 4 letters and/or numbers.
[29 Apr 2016 10:14]
Claudio Fior
Same problem, solved with a php script $wordH = fopen(__DIR__.'/words.txt', 'r'); while (($rawWord = fgets($wordH)) !== false) { $rawWord = trim($rawWord); $word = ''; $wrongOrd = array(); for ($c =0; $c < strlen($rawWord); $c++) { $char = $rawWord[$c]; if (ord($char)>127) { $wrongOrd[] = ord($char); continue; } else if (sizeof($wrongOrd)>0) { $char = fixChars($wrongOrd); $wrongOrd = array(); } $word .= $char; } if (sizeof($wrongOrd)>0) { $word .= fixChars($wrongOrd); } $words[] = $word; } function fixChars ($wrongOrd) { $char = ''; foreach($wrongOrd as $pos=>$ord) { $bin = decbin($ord); if ($pos == 0 && strlen($bin) == 8 && substr($bin,0,3) == '111') { $bin = '110'.substr($bin,3); } $char .= chr(bindec($bin)); } return $char; }