Bug #19580 FULLTEXT search produces wrong results on UTF-8 columns
Submitted: 6 May 2006 14:39 Modified: 22 Jun 2006 16:16
Reporter: Steffen Weber Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:4.1.18 OS:Linux (Gentoo Linux)
Assigned to: Sergey Vojtovich CPU Architecture:Any

[6 May 2006 14:39] Steffen Weber
Description:
MySQL handles the characters „ (HTML „) and “ (HTML “) as if they belong to the word in between them, although they should be treated like a blank by the FULLTEXT index.

How to repeat:
Execute the following SQL queries:

CREATE TABLE `fulltext_utf8` (
	`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
	`text` VARCHAR( 255 ) NOT NULL ,
	FULLTEXT (`text`)
) ENGINE = MYISAM DEFAULT CHARSET=utf8;
INSERT INTO `fulltext_utf8` VALUES (NULL , '„MySQL“'), (NULL , 'MySQL');
SELECT `id`, `text` FROM `fulltext_utf8` WHERE MATCH (`text`) AGAINST ('MySQL' IN BOOLEAN MODE);

+----+-------+
| id | text  |
+----+-------+
|  2 | MySQL |
+----+-------+
1 row in set (0.00 sec)

As you can see the SELECT statement only matches the second row, although it should match both rows.

Suggested fix:
Treat the mentioned characters as if they were blanks in the FULLTEXT index.
[25 May 2006 18:16] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6883
[29 May 2006 11:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6965
[19 Jun 2006 19:05] Sergey Vojtovich
Pushed into tree currently marked as 5.1.12.
[22 Jun 2006 14:39] Sergey Vojtovich
Paul,

since there is a need to rebuild index only, REPAIR TABLE tbl_name QUICK is fine.
[22 Jun 2006 16:16] Paul DuBois
Noted in 5.1.12 changelog, and in the upgrading-to-5.1
section.

Incompatible change: For utf8 columns, the full-text parser
incorrectly considered several non-word punctuation and whitespace
characters as word characters, causing some searches to return
incorrect results.

The fix involves a change to the full-text parser, so any tables that
have FULLTEXT indexes on utf8 columns must be repaired with REPAIR
TABLE:  

REPAIR TABLE tbl_name QUICK;
[13 Jun 2007 8:33] Sergey Vojtovich
BUG#28861 was marked as duplicate.