MySQL Bugs: #34728: Fulltext parser misbehave on single apostrophe

Bug #34728	Fulltext parser misbehave on single apostrophe
Submitted:	21 Feb 2008 14:34	Modified:	22 Feb 2008 13:00
Reporter:	Daniele Stanzani	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: FULLTEXT search	Severity:	S3 (Non-critical)
Version:	5.0.51, 5.0.54, 5.1.23	OS:	Linux
Assigned to:		CPU Architecture:	Any

Description:
I actually don't know whether this is a bug or not.

I'm filing this bug because I suppose this is not the right behavior.

I suppose there is a misbehavior of the fulltext parser when creating indexes.

Consider a column named title (type tinytext) with a fulltext index and a row of that column: "L'altro delitto"

When executing this query:

SELECT `id` FROM `table` WHERE MATCH (`title`) AGAINST ('+altro* +delitto*' IN BOOLEAN MODE ) >= 2

The table and column collation is latin1_swedish_ci.
This is my <ctype><map> for latin1 collation
<ctype>
<map>
 00
 20 20 20 20 20 20 20 20 20 28 28 28 28 28 20 20
 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20 20
 48 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 84 84 84 84 84 84 84 84 84 84 10 10 10 10 10 10
 10 81 81 81 81 81 81 01 01 01 01 01 01 01 01 01
 01 01 01 01 01 01 01 01 01 01 01 10 10 10 10 10
 10 82 82 82 82 82 82 02 02 02 02 02 02 02 02 02
 02 02 02 02 02 02 02 02 02 02 02 10 10 10 10 20
 10 00 10 02 10 10 10 10 10 10 01 10 01 00 01 00
 00 10 10 10 10 10 10 10 10 10 02 10 02 00 02 01
 48 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01 01
 01 01 01 01 01 01 01 10 01 01 01 01 01 01 01 02
 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02 02
 02 02 02 02 02 02 02 10 02 02 02 02 02 02 02 02
</map>
</ctype>

<collation name="latin1_swedish_ci">
<map>
 00 01 02 03 04 05 06 07 08 09 0A 0B 0C 0D 0E 0F
 10 11 12 13 14 15 16 17 18 19 1A 1B 1C 1D 1E 1F
 20 21 22 23 24 25 26 27 28 29 2A 2B 2C 2D 2E 2F
 30 31 32 33 34 35 36 37 38 39 3A 3B 3C 3D 3E 3F
 40 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
 50 51 52 53 54 55 56 57 58 59 5A 5B 5C 5D 5E 5F
 60 41 42 43 44 45 46 47 48 49 4A 4B 4C 4D 4E 4F
 50 51 52 53 54 55 56 57 58 59 5A 7B 7C 7D 7E 7F
 80 81 82 83 84 85 86 87 88 89 8A 8B 8C 8D 8E 8F
 90 91 92 93 94 95 96 97 98 99 9A 9B 9C 9D 9E 9F
 A0 A1 A2 A3 A4 A5 A6 A7 A8 A9 AA AB AC AD AE AF
 B0 B1 B2 B3 B4 B5 B6 B7 B8 B9 BA BB BC BD BE BF
 41 41 41 41 5C 5B 5C 43 45 45 45 45 49 49 49 49
 44 4E 4F 4F 4F 4F 5D D7 D8 55 55 55 59 59 DE DF
 41 41 41 41 5C 5B 5C 43 45 45 45 45 49 49 49 49
 44 4E 4F 4F 4F 4F 5D F7 D8 55 55 55 59 59 DE FF
</map>
</collation>

As you can see, single apostrophe 0x27 is mapped to 0x10, Not-white, not a word.

I've seen that in the file myisam/ftdefs.h at row 26 is defined the single apostrophe ad as a word:

#define misc_word_char(X)       ((X)=='\'')

I'm not a C/C++ developer, but i suppose that the single apostrophe is hard coded in the source code of mysql, and that this specific condition overrides the charset <ctype><map>.

This is a great trouble for those languages, such as Italian, where "L'" is like "The" in English, and Italian grammar states that between the apostrophe and the noun must not be any white space.
So "L'altro" is  correct, "L' altro" is wrong in Italian.

Fulltext index considers "L'altro" as a single word, which is not wrong, because it is an article "L" plus apostrophe "'" plus noun "altro".

How to repeat:
Crate a table with a fulltext index on the colum "title"
+----------+---------------------+------+-----+---------+----------------+
| Field    | Type                | Null | Key | Default | Extra          |
+----------+---------------------+------+-----+---------+----------------+
| id       | int(10) unsigned    | NO   | PRI | NULL    | auto_increment | 
| title    | tinytext            | NO   | MUL |         |                | 
+----------+---------------------+------+-----+---------+----------------+

insert into that table (`id`,`title`)('1','L\'altro delitto')
run the query

mysql> SELECT `id` FROM `table` WHERE MATCH ( `title` ) AGAINST ( '+altro* +delitto*' IN BOOLEAN MODE ) >= 2;
Empty set (0.00 sec)

The query returns an empty result set.
It should return 1 result.

Suggested fix:
Remove the apostrophe as a built in word from the source code.
Consider using <ctype><map> for correct behaviour.

Errata corrige:

Fulltext index considers "L'altro" as a single word, which IS wrong, because it is an
article "L" plus apostrophe "'" plus noun "altro".

See also BUG#14194.

Thank you for a bug report. Verified just as described with both 5.0.54 and 5.1.23-rc:

C:\Program Files\MySQL\MySQL Server 5.1\bin>mysql -uroot -proot -P3310 test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3
Server version: 5.1.23-rc-community-debug MySQL Community Server - Debug (GPL)

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create table it(id int primary key, title tinytext, fulltext(title)) engine=MyISAM;
Query OK, 0 rows affected (0.05 sec)

mysql> insert into it values ('1','L\'altro delitto');
Query OK, 1 row affected (0.03 sec)

mysql> SELECT `id` FROM it WHERE MATCH ( `title` ) AGAINST ( '+altro* +delitto*'
 IN BOOLEAN MODE ) >= 2;
Empty set (0.03 sec)

Daniele, this (apostrophe being hard-coded as part of a word) is already fixed in 5.1.

Valeriy, your test case doesn't prove anything. This query will never return anything. This MATCH can never return a value larger than or equal to 2. Try it without an apostrophe.