Bug #28581 | Wrong matches on utf-8 words (cyrillic) | ||
---|---|---|---|
Submitted: | 22 May 2007 8:15 | Modified: | 29 Jul 2007 9:38 |
Reporter: | Nicolae Namolovan | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server: FULLTEXT search | Severity: | S3 (Non-critical) |
Version: | 5.1.18 | OS: | FreeBSD (6.1) |
Assigned to: | Sergey Vojtovich | CPU Architecture: | Any |
[22 May 2007 8:15]
Nicolae Namolovan
[22 May 2007 9:55]
Sveta Smirnova
Thank you for the report. I can not repeat described issue with current sources. Please upgrade to current version 5.1.18, try with it and if you can repeat provide output of SHOW VARIABLES LIKE '%char%'
[30 May 2007 7:20]
Nicolae Namolovan
(tested on 5.1.18) >SHOW VARIABLES LIKE '%char%' It is standart +--------------------------+----------------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | utf8 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | utf8 | | character_set_system | utf8 | | character_sets_dir | /usr/local/share/mysql/charsets/ | +--------------------------+----------------------------------+ If I do "set names utf8" It work fine.. But If I do "set names utf8" and try to select from a table where the inserts was without "set names utf8", no results. That's strange because with 5.0 all works fine without "set names utf8". Anyway, I found a workarround, If I put the search words in double quotes ("), everything works fine.. ------------ SELECT name, MATCH (test_searchindex.name) AGAINST ('+"Саботаж"' IN BOOLEAN MODE) FROM test_searchindex WHERE MATCH (test_searchindex.name) AGAINST ('+"Саботаж"' IN BOOLEAN MODE); array(1) { [0]=> array(2) { ["name"]=> string(14) "Саботаж" ["MATCH (test_searchindex.name) AGAINST ('+\"Саботаж\"' IN BOOLEAN MODE)"]=> string(1) "1" } } ------------ It is logic what "double quotes" helps ? I still think it is a bug..
[30 May 2007 7:49]
Sveta Smirnova
Thank you for the feedback. Verified inconsistency between 5.0, 4.1 and 5.1 version. Foolowing case loadable by our test suite returns correct result in 5.0 and 4.1, but nothing in 5.1: set names latin1; CREATE TABLE `test_searchindex` ( `id` int(11) unsigned NOT NULL AUTO_INCREMENT, `name` varchar(255) DEFAULT NULL, PRIMARY KEY (`id`), FULLTEXT KEY `name` (`name`) ) ENGINE=MyISAM DEFAULT CHARSET=utf8; INSERT INTO test_searchindex SET name='Саботаж'; INSERT INTO test_searchindex SET name='Лис и его охотничий пес'; SELECT name, MATCH (test_searchindex.name) AGAINST ('+Саботаж' IN BOOLEAN MODE) FROM test_searchindex WHERE MATCH (test_searchindex.name) AGAINST ('+Саботаж' IN BOOLEAN MODE); Severity has been decreased, because: 1. No crash 2. There is workaround: set names utf8;
[4 Jun 2007 12:47]
Sergey Vojtovich
Hmm...try echo "Саботаж" |iconv -f latin1 -t utf8 and you'll get nonsensical string, which gets stored in a database. set names utf8 is not a workaround, but rather correct usage in this case. If one feeds database with UTF-8 string and says that it is actually latin1 - he will likely get unexpected results. It worked well with 5.0 because we had dummy ctype support for multi-byte character sets (all UTF-8 symbols with first byte value higher than 127 were considered as letters). In 5.1 we have fair ctype support.
[5 Jun 2007 13:12]
Nicolae Namolovan
Sergey, thank you for answer. Can you tell me please how to fix my database ? I want to use SET NAMES utf8 everywhere but can't because (like you said) I insert utf8 strings to database saying what it's latin1 ;o) Will smth. like..: SET NAMES latin1; $array = SELECT ... SET NAMES utf8; INSERT INTO ..$array fix my problem ? Is there a more fast way to convert.. ?
[5 Jun 2007 14:20]
Sergey Vojtovich
Yes, that should fix the problem. Also following statements worked for me: ALTER TABLE t1 MODIFY a BLOB; UPDATE t1 SET a=CONVERT(CAST(a as CHAR CHARACTER SET utf8) using latin1); ALTER TABLE t1 MODIFY a TEXT CHARSET utf8;
[29 Jul 2007 9:38]
Nicolae Namolovan
Thanks Sergey, that worked for me too. Had a nice night ;] Remark for those like me who want to fix a char/text column with some self made script, if column have a key without a key length, when converted to BLOB you'll get a error "BLOB/TEXT column 'info_hash' used in key specification without a key length", that's why it's a good idea to always put key length..