MySQL Bugs: #38356: myisam_ftdump output of unicode characters such as 'ć' unreadable

Bug #38356	myisam_ftdump output of unicode characters such as 'ć' unreadable
Submitted:	24 Jul 2008 21:04	Modified:	18 Jul 2014 19:44
Reporter:	Miguel K	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: FULLTEXT search	Severity:	S3 (Non-critical)
Version:	5.5	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
myisam_ftdump -d outputs data in MySQL's internal encoding, not utf8 or other useful encoding.

For instance, a word with the letter 'ć' is not output in an encoding that is useful.

How to repeat:
Create a table with a utf8 column with a full text index.

Put this word in that column:
ćwiczyć

Now use myisam_ftdump to retrieve the contents of the full text index.
You get:
ä‡wiczyä‡
instead of
ćwiczyć

Suggested fix:
Output data in utf8 or other useful encoding.

Verified with actual 5.0 bzr tree.

I used utf8 terminal.

SET NAMES UTF;

CREATE database bug38356 character set utf8;
USE bug38356;
CREATE table t(c char(100) character set utf8, v varchar(100) character set utf8, t text character set utf8);
CREATE fulltext index ic on t(c);
CREATE fulltext index iv on t(v);
CREATE fulltext index it on t(t);

INSERT into t values('bär','bär','bär'),('ćwiczyć','ćwiczyć','ćwiczyć');

SELECT length(c), length(v), length(t) from t;

Length is 4 for the 'bär' and 9 for the 'ćwiczyć'. So data are stored in UTF8.

$ ./bin/myisam_ftdump /home/myhome/mysql50bzr/var/bug38356/t 0

Longest word: 9 chars (��wiczy��)

My terminal is still set to utf8 ....

The same result with index num 1 and 2.

Now I switched terminal to ISO-8859-15:

Longest word: 9 chars (äwiczyä)

Why the hell an 'ä'? also this two rectangles aren't displayed in my terminal I just could see them after copy paste here.

It should be 'ćwiczyć' and not 'äwiczyä'

The output is neither utf8 nor latin1 ... it looks like a double encoded utf8.

Same behaviour on MySQL 5.1 bzr tree.

After a little more investigation, I see that it is outputting the Hex value of that letter instead of the utf8 value.

SELECT HEX('ć');
gives
C487

Looking at the hex output of the myisam_ftdump, you see that it is encoded as C487 as well.

I'm still experiencing this bug, but I'm not finding characters to be their simple hex values.

Rather, two-bit sequences are having their third bit set.

For example, "ä" is unicode 228 (binary 11100100). 

To make UTF-8, we substitue the x's and y's:

110yyyxx 10xxxxxx

To get:

11000011 10100100

But what MySQL is putting in the files is:

11100011 10100100

I wrote my own utility to change the initial "111" to "110" in all two-byte sequences which start with a byte >127, and it fixed all the characters in a variety of European languages to be correct UTF-8.

it's because of my_casedn_str() in myisam_ftdump.c, which was supposed to make the output nicer, but only works for latint1 :(

Wow, nearly five years later, I am running into the same bug!

It doesn't seem like a difficult one to fix.

Still a problem in 5.5.

After 6 years, I finally have a work-around!!!

The problem is that it outputs it in the wrong encoding and additionally has changed the wrong encoding to lowercase.

UTF-8 badly encoded as Latin1 would show "niño" as:
niÃ±o

But this program outputs the Ã as an ã:
niã±o

The work-around is to change the text to uppercase, convert it, and then back to lowercase.  It only took me six years to figure it out.

I do it in Notepad++
Highlight text > Edit > Convert to > Uppercase
Encoding > Encode in UTF-8
Highlight text > Edit > Convert to > Lowercase

If you have imported it into MySQL, use this (Spanish, Greek, Arabic below):
select lower(convert(binary convert(upper("niã±a") using latin1) using utf8));
select lower(convert(binary convert(upper("îºî»î¬ïˆî±") using latin1) using utf8));
select lower(convert(binary convert(upper("øªøø¯ùŠø¯") using latin1) using utf8));

The scary thing is that this might be how full-text index data for utf8_unicode_ci is stored in mysql: lowercase versions of a misencoding to latin1.

Happy sixth birthday, bug!

I'm currently have the same issue in 5.6.

I exported the myisam_ftdump from a full-text index I had and saved it in a txt file.
txt file has the latin1 encoding and converted to utf-8 with the following command:
iconv -f Latin1 -t utf-8 ~/Desktop/test.txt > ~/Desktop/1.txt

By using the following command it perfectly worked for cases where the character  was not present.
lower(convert(binary convert(upper("Î§Î±Î»ÎºÎ¹Î´Î¹ÎºÎ®") using latin1) using utf8))

By feeding back the mysql with all instances one-by-one (or with a script), I can read the index entries, yet what about the ones with the  which seems to be totally corrupted?

Any idea or workaround is more than welcome!

Obviously the corrupted character was not displayed, you can think about a square symbol with 4 letters and/or numbers.

Same problem, solved with a php script

$wordH = fopen(__DIR__.'/words.txt', 'r');
while (($rawWord = fgets($wordH)) !== false) {
   $rawWord = trim($rawWord);
   $word = '';
   $wrongOrd = array();
   for ($c =0; $c < strlen($rawWord); $c++) {
         $char = $rawWord[$c];
         if (ord($char)>127) {
            $wrongOrd[] = ord($char);
            continue;
         } else if (sizeof($wrongOrd)>0) {
           $char = fixChars($wrongOrd);
           $wrongOrd = array();         
         }
         $word .= $char;
      }
      if (sizeof($wrongOrd)>0) {
         $word .= fixChars($wrongOrd);
      }
   $words[] = $word;
}

function fixChars ($wrongOrd) {
   $char = '';
   foreach($wrongOrd as $pos=>$ord) {
      $bin = decbin($ord);
      if ($pos == 0 && strlen($bin) == 8 && substr($bin,0,3) == '111') {
         $bin = '110'.substr($bin,3);
      }
      $char .= chr(bindec($bin));
   }
   return $char;
}