Bug #19580 FULLTEXT search produces wrong results on UTF-8 columns
Submitted: 6 May 2006 16:39 Modified: 22 Jun 2006 18:16
Reporter: Steffen Weber
Status: Closed
Category:Server Severity:S3 (Non-critical)
Version:4.1.18 OS:Linux (Gentoo Linux)
Assigned to: Sergey Vojtovich Target Version:

[6 May 2006 16:39] Steffen Weber
Description:
MySQL handles the characters „ (HTML „) and “ (HTML “) as if they belong
to the word in between them, although they should be treated like a blank by the FULLTEXT
index.

How to repeat:
Execute the following SQL queries:

CREATE TABLE `fulltext_utf8` (
	`id` INT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY ,
	`text` VARCHAR( 255 ) NOT NULL ,
	FULLTEXT (`text`)
) ENGINE = MYISAM DEFAULT CHARSET=utf8;
INSERT INTO `fulltext_utf8` VALUES (NULL , '„MySQL“'), (NULL , 'MySQL');
SELECT `id`, `text` FROM `fulltext_utf8` WHERE MATCH (`text`) AGAINST ('MySQL' IN BOOLEAN
MODE);

+----+-------+
| id | text  |
+----+-------+
|  2 | MySQL |
+----+-------+
1 row in set (0.00 sec)

As you can see the SELECT statement only matches the second row, although it should match
both rows.

Suggested fix:
Treat the mentioned characters as if they were blanks in the FULLTEXT index.
[25 May 2006 20:16] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6883
[29 May 2006 13:52] Bugs System
A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/6965
[19 Jun 2006 21:05] Sergey Vojtovich
Pushed into tree currently marked as 5.1.12.
[22 Jun 2006 16:39] Sergey Vojtovich
Paul,

since there is a need to rebuild index only, REPAIR TABLE tbl_name QUICK is fine.
[22 Jun 2006 18:16] Paul DuBois
Noted in 5.1.12 changelog, and in the upgrading-to-5.1
section.

Incompatible change: For utf8 columns, the full-text parser
incorrectly considered several non-word punctuation and whitespace
characters as word characters, causing some searches to return
incorrect results.

The fix involves a change to the full-text parser, so any tables that
have FULLTEXT indexes on utf8 columns must be repaired with REPAIR
TABLE:  

REPAIR TABLE tbl_name QUICK;
[13 Jun 2007 10:33] Sergey Vojtovich
BUG#28861 was marked as duplicate.