MySQL Bugs: #38352: MATCH ... AGAINST ... WITH PARSER ...

Bug #38352	MATCH ... AGAINST ... WITH PARSER ...
Submitted:	24 Jul 2008 17:56	Modified:	24 Jul 2008 18:42
Reporter:	Hartmut Holzgraefe	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: FULLTEXT search	Severity:	S4 (Feature request)
Version:	5.1	OS:	Any
Assigned to:	Assigned Account	CPU Architecture:	Any

Description:
When using a fulltext parser plugin the same plugin is used for both tokenizing column data and AGAINST() search terms. The data passed to AGAINST() may be in a different format than what is stored though (and for data stored in compressed form or some document format like e.g. PDF it most likely is) so it would be nice to be able to declare the desired parser plugin not only per FULLTEXT INDEX but also per MATCH...AGAINST expression on a case by case basis.

Additional request: make it possible to specify different "index" and "search" parsers in the index definition right away.

How to repeat:
Create a fulltext parser plugin and see that both column data and AGAINST() arguments are passed to the same plugin.

Suggested fix:
Extend MATCH..AGAINST with a "WITH PARSER ..." clause (similar to the already existing extension to CREATE INDEX) and pass the AGAINST() expression to the desired parser instead of the indexes default one.

Additional request:

Extend CREATE INDEX with a 2nd "WITH PARSER ..." clause, e.g.

  FULLTEXT INDEX (doc) WITH PARSER my_parser WITH SEARCH_PARSER default;

or 

  FULLTEXT INDEX(doc) WITH PARSER my_parser FOR INDEX, WITH PARSER default FOR SEARCH;

There are also use cases with data that is not really textual where stored and queried data may differ in format/syntax/encoding but comes down to the same tokens being used internally.

I'm thinking about a protein database right now where

- rows store DNA/RNA nucleotide sequences using the A-C-G-T or A-C-G-U "alphabet"
- queries search for amino acid sequences instead using one letter acid codes
- the indexed tokens are nucletide triplets 
- from the input alone it is not clear whether the sequence "GAC" is
  - the nucleotide triplet codon for Alanine
  - the amino acid sequence Glycine-Alanine-Cysteine

Here a combination of

  FULLTEXT ... WITH PARSER dna_sequence;

and 

  MATCH .. AGAINST ('G A C') WITH PARSER dna_sequence;

  MATCH .. AGAINST ('G A C') WITH PARSER rna_sequence;

  MATCH .. AGAINST ('G A C') WITH PARSER amino_sequence;

would be needed to allow different queries against a 
DNA sequence database