Bug #38356 myisam_ftdump output of unicode characters such as 'ć' unreadable
Submitted: 24 Jul 2008 21:04 Modified: 18 Jul 2014 19:44
Reporter: Miguel K Email Updates:
Status: Verified Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:5.5 OS:Any
Assigned to: CPU Architecture:Any

[24 Jul 2008 21:04] Miguel K
Description:
myisam_ftdump -d outputs data in MySQL's internal encoding, not utf8 or other useful encoding.

For instance, a word with the letter 'ć' is not output in an encoding that is useful.

How to repeat:
Create a table with a utf8 column with a full text index.

Put this word in that column:
ćwiczyć

Now use myisam_ftdump to retrieve the contents of the full text index.
You get:
ä‡wiczyä‡
instead of
ćwiczyć

Suggested fix:
Output data in utf8 or other useful encoding.
[25 Jul 2008 9:17] Susanne Ebrecht
Verified with actual 5.0 bzr tree.

I used utf8 terminal.

SET NAMES UTF;

CREATE database bug38356 character set utf8;
USE bug38356;
CREATE table t(c char(100) character set utf8, v varchar(100) character set utf8, t text character set utf8);
CREATE fulltext index ic on t(c);
CREATE fulltext index iv on t(v);
CREATE fulltext index it on t(t);

INSERT into t values('bär','bär','bär'),('ćwiczyć','ćwiczyć','ćwiczyć');

SELECT length(c), length(v), length(t) from t;

Length is 4 for the 'bär' and 9 for the 'ćwiczyć'. So data are stored in UTF8.

$ ./bin/myisam_ftdump /home/myhome/mysql50bzr/var/bug38356/t 0

Longest word: 9 chars (��wiczy��)

My terminal is still set to utf8 ....

The same result with index num 1 and 2.

Now I switched terminal to ISO-8859-15:

Longest word: 9 chars (ä‡wiczyä‡)

Why the hell an 'ä'? also this two rectangles aren't displayed in my terminal I just could see them after copy paste here.

It should be 'ćwiczyć' and not 'ä‡wiczyä‡'

The output is neither utf8 nor latin1 ... it looks like a double encoded utf8.
[25 Jul 2008 9:20] Susanne Ebrecht
Same behaviour on MySQL 5.1 bzr tree.
[25 Jul 2008 15:57] Miguel K
After a little more investigation, I see that it is outputting the Hex value of that letter instead of the utf8 value.

SELECT HEX('ć');
gives
C487

Looking at the hex output of the myisam_ftdump, you see that it is encoded as C487 as well.
[1 Dec 2009 21:49] Michael Baldwin
I'm still experiencing this bug, but I'm not finding characters to be their simple hex values.

Rather, two-bit sequences are having their third bit set.

For example, "ä" is unicode 228 (binary 11100100). 

To make UTF-8, we substitue the x's and y's:

110yyyxx 10xxxxxx

To get:

11000011 10100100

But what MySQL is putting in the files is:

11100011 10100100

I wrote my own utility to change the initial "111" to "110" in all two-byte sequences which start with a byte >127, and it fixed all the characters in a variety of European languages to be correct UTF-8.
[1 Dec 2009 22:58] Sergei Golubchik
it's because of my_casedn_str() in myisam_ftdump.c, which was supposed to make the output nicer, but only works for latint1 :(
[11 Feb 2013 16:01] Miguel K
Wow, nearly five years later, I am running into the same bug!

It doesn't seem like a difficult one to fix.

Still a problem in 5.5.
[18 Jul 2014 19:44] Miguel K
After 6 years, I finally have a work-around!!!

The problem is that it outputs it in the wrong encoding and additionally has changed the wrong encoding to lowercase.

UTF-8 badly encoded as Latin1 would show "niño" as:
niño

But this program outputs the à as an ã:
niã±o

The work-around is to change the text to uppercase, convert it, and then back to lowercase.  It only took me six years to figure it out.

I do it in Notepad++
Highlight text > Edit > Convert to > Uppercase
Encoding > Encode in UTF-8
Highlight text > Edit > Convert to > Lowercase

If you have imported it into MySQL, use this (Spanish, Greek, Arabic below):
select lower(convert(binary convert(upper("niã±a") using latin1) using utf8));
select lower(convert(binary convert(upper("îºî»î¬ïˆî±") using latin1) using utf8));
select lower(convert(binary convert(upper("øªø­ø¯ùŠø¯") using latin1) using utf8));

The scary thing is that this might be how full-text index data for utf8_unicode_ci is stored in mysql: lowercase versions of a misencoding to latin1.

Happy sixth birthday, bug!
[11 Nov 2015 21:55] Pavlos Ponos
I'm currently have the same issue in 5.6.

I exported the myisam_ftdump from a full-text index I had and saved it in a txt file.
txt file has the latin1 encoding and converted to utf-8 with the following command:
iconv -f Latin1 -t utf-8 ~/Desktop/test.txt > ~/Desktop/1.txt

By using the following command it perfectly worked for cases where the character „ was not present.
lower(convert(binary convert(upper("Χαλκιδική") using latin1) using utf8))

By feeding back the mysql with all instances one-by-one (or with a script), I can read the index entries, yet what about the ones with the „ which seems to be totally corrupted?

Any idea or workaround is more than welcome!
[11 Nov 2015 22:01] Pavlos Ponos
Obviously the corrupted character was not displayed, you can think about a square symbol with 4 letters and/or numbers.
[29 Apr 2016 10:14] Claudio Fior
Same problem, solved with a php script

$wordH = fopen(__DIR__.'/words.txt', 'r');
while (($rawWord = fgets($wordH)) !== false) {
   $rawWord = trim($rawWord);
   $word = '';
   $wrongOrd = array();
   for ($c =0; $c < strlen($rawWord); $c++) {
         $char = $rawWord[$c];
         if (ord($char)>127) {
            $wrongOrd[] = ord($char);
            continue;
         } else if (sizeof($wrongOrd)>0) {
           $char = fixChars($wrongOrd);
           $wrongOrd = array();         
         }
         $word .= $char;
      }
      if (sizeof($wrongOrd)>0) {
         $word .= fixChars($wrongOrd);
      }
   $words[] = $word;
}

function fixChars ($wrongOrd) {
   $char = '';
   foreach($wrongOrd as $pos=>$ord) {
      $bin = decbin($ord);
      if ($pos == 0 && strlen($bin) == 8 && substr($bin,0,3) == '111') {
         $bin = '110'.substr($bin,3);
      }
      $char .= chr(bindec($bin));
   }
   return $char;
}