Bug #61864 Implement B-Tree indexes on CSV files
Submitted: 14 Jul 2011 11:38
Reporter: David Nuttall Email Updates:
Status: Open Impact on me:
None 
Category:MySQL Server: CSV Severity:S4 (Feature request)
Version: OS:Any
Assigned to: CPU Architecture:Any
Tags: INDEX

[14 Jul 2011 11:38] David Nuttall
Description:
CSV files are flat text files and so searches for records would on average need to read through half the table to find the record.  Because records using the CSV engine are variable length, even for fields that are not normally considered variable length, such as DOUBLE or INT, it is not easy to calculate where the next field or record begins, so each character in turn must be read, looking for commas and new-lines.  Without any other mechanism, every search must be a brute-force search.

How to repeat:
When I try to add an index to a CSV table, even a PRIMARY KEY, I get the error message:
#1069 - Too many keys specified; max 0 keys allowed.

Suggested fix:
I would like to recommend implementing B-Tree indexes on CSV files.  The B-Tree index would have either a copy of the data being searched in standard (non-text) layout, or a pointer to the data in the CSV file.  Also each B-Tree would have a pointer to the beginning of the record in the CSV file.

I would recommend that records be written in arbitrary order (as INSERT'ed) and only be clustered to the primary key during an OPTIMIZE TABLE operation.

Advantages:
- PRIMARY, UNIQUE and INDEX keys can be defined and enforced in the index, without having actually having to look in the data file until a hit is registered.
- Programs that generate SQL, such as PhpMyAdmin, could read the index list and generate searches with much simple WHERE clauses.  Currently, PhpMyAdmin uses every field to identify records where a PRIMARY KEY does not exists.
- An auto-increment field could be assigned and enforced.
- Searches in WHERE clauses would become much faster, potentially almost to the speed of other engines, such as InnoDB or MyISAM.
- An INSERT would add records to the end and the index would only need to add a few index records, a relatively simple operation.
- Improve or allow support for SELECT, INSERT, UPDATE, DELETE, OPTIMIZE TABLE, TRUNCATE TABLE
- Allow the definition of internal relations, as with other indexes.

Disadvantages:
- After almost any UPDATE or DELETE, the index would likely have to rebuilt entirely.  This would make the slow procedure of UPDATE or DELETE on a text file even slower.  This disadvantage should be noted in documentation, that UPDATE's and DELETE's should be infrequent, if at all possible, but SELECT's and even INSERT's should be the primary operations on this table type.

Already, UPDATE and DELETE operations to a CSV table will be slow, in that the table basically has to be rebuilt from the affected records onward.  Having the index will allow the WHERE clauses of UPDATE and DELETE to find matching records faster, but the rest of the operation would be the same speed (rebuilding the CSV file) or slower (rebuilding the index).