Bug #34728 Fulltext parser misbehave on single apostrophe
Submitted: 21 Feb 2008 14:34 Modified: 22 Feb 2008 13:00
Reporter: Daniele Stanzani Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: FULLTEXT search Severity:S3 (Non-critical)
Version:5.0.51, 5.0.54, 5.1.23 OS:Linux
Assigned to: CPU Architecture:Any

[21 Feb 2008 14:34] Daniele Stanzani
Description:
I actually don't know whether this is a bug or not.

I'm filing this bug because I suppose this is not the right behavior.

I suppose there is a misbehavior of the fulltext parser when creating indexes.

Consider a column named title (type tinytext) with a fulltext index and a row of that column: "L'altro delitto"

When executing this query:

SELECT `id` FROM `table` WHERE MATCH (`title`) AGAINST ('+altro* +delitto*' IN BOOLEAN MODE ) >= 2

The table and column collation is latin1_swedish_ci.
This is my <ctype><map> for latin1 collation
<ctype>
<map>
 00
 20 20 20 20 20 20 20 20 20 28 28 28 28 28 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 48 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 84 84 84 84 84 84 84 84 84 84 10 10 10 10 10 10
 10 81 81 81 81 81 81 01 01 01 01 01 01 01 01 01
 01 01 01 01 01 01 01 01 01 01 01 10 10 10 10 10
 10 82 82 82 82 82 82 02 02 02 02 02 02 02 02 02
 02 02 02 02 02 02 02 02 02 02 02 10 10 10 10 20
 10 00 10 02 10 10 10 10 10 10 01 10 01 00 01 00
 00 10 10 10 10 10 10 10 10 10 02 10 02 00 02 01
 48 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
 01 01 01 01 01 01 01 10 01 01 01 01 01 01 01 02
 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
 02 02 02 02 02 02 02 10 02 02 02 02 02 02 02 02
</map>
</ctype>

<collation name="latin1_swedish_ci">
<map>
 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
 41 41 41 41 5C 5B 5C 43 45 45 45 45 49 49 49 49
 44 4E 4F 4F 4F 4F 5D D7 D8 55 55 55 59 59 DE DF
 41 41 41 41 5C 5B 5C 43 45 45 45 45 49 49 49 49
 44 4E 4F 4F 4F 4F 5D F7 D8 55 55 55 59 59 DE FF
</map>
</collation>

As you can see, single apostrophe 0x27 is mapped to 0x10, Not-white, not a word.

I've seen that in the file myisam/ftdefs.h at row 26 is defined the single apostrophe ad as a word:

#define misc_word_char(X)       ((X)=='\'')

I'm not a C/C++ developer, but i suppose that the single apostrophe is hard coded in the source code of mysql, and that this specific condition overrides the charset <ctype><map>.

This is a great trouble for those languages, such as Italian, where "L'" is like "The" in English, and Italian grammar states that between the apostrophe and the noun must not be any white space.
So "L'altro" is  correct, "L' altro" is wrong in Italian.

Fulltext index considers "L'altro" as a single word, which is not wrong, because it is an article "L" plus apostrophe "'" plus noun "altro".

How to repeat:
Crate a table with a fulltext index on the colum "title"
+----------+---------------------+------+-----+---------+----------------+
| Field    | Type                | Null | Key | Default | Extra          |
+----------+---------------------+------+-----+---------+----------------+
| id       | int(10) unsigned    | NO   | PRI | NULL    | auto_increment | 
| title    | tinytext            | NO   | MUL |         |                | 
+----------+---------------------+------+-----+---------+----------------+

insert into that table (`id`,`title`)('1','L\'altro delitto')
run the query

mysql> SELECT `id` FROM `table` WHERE MATCH ( `title` ) AGAINST ( '+altro* +delitto*' IN BOOLEAN MODE ) >= 2;
Empty set (0.00 sec)

The query returns an empty result set.
It should return 1 result.

Suggested fix:
Remove the apostrophe as a built in word from the source code.
Consider using <ctype><map> for correct behaviour.
[21 Feb 2008 14:36] Daniele Stanzani
Errata corrige:

Fulltext index considers "L'altro" as a single word, which IS wrong, because it is an
article "L" plus apostrophe "'" plus noun "altro".
[21 Feb 2008 14:58] Sergey Vojtovich
See also BUG#14194.
[21 Feb 2008 17:22] Valeriy Kravchuk
Thank you for a bug report. Verified just as described with both 5.0.54 and 5.1.23-rc:

C:\Program Files\MySQL\MySQL Server 5.1\bin>mysql -uroot -proot -P3310 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.1.23-rc-community-debug MySQL Community Server - Debug (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create table it(id int primary key, title tinytext, fulltext(title)) engine=MyISAM;
Query OK, 0 rows affected (0.05 sec)

mysql> insert into it values ('1','L\'altro delitto');
Query OK, 1 row affected (0.03 sec)

mysql> SELECT `id` FROM it WHERE MATCH ( `title` ) AGAINST ( '+altro* +delitto*'
 IN BOOLEAN MODE ) >= 2;
Empty set (0.03 sec)
[22 Feb 2008 13:00] Sergei Golubchik
Daniele, this (apostrophe being hard-coded as part of a word) is already fixed in 5.1.

Valeriy, your test case doesn't prove anything. This query will never return anything. This MATCH can never return a value larger than or equal to 2. Try it without an apostrophe.