MySQL Bugs: #3632: load data local ... lines starting by = inserts bad data

Bug #3632	load data local ... lines starting by = inserts bad data
Submitted:	3 May 2004 5:35	Modified:	27 Jan 2012 19:51
Reporter:	dave mausner	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: DML	Severity:	S4 (Feature request)
Version:	5.1	OS:	Any
Assigned to:		CPU Architecture:	Any

Description:
problem: null and incorrect column values, with huge warning counts, inserted into tables when local load file contains two sets of input records marked with different prefix strings. the intention was to run both LOAD DATA LOCAL commands over the same data file.

workaround: had to separate the data for each table into different files. then the errors disappear, correct record counts, no warnings.

is this a bug? yes: the mySQL doc states that STARTING BY ignores a record that does not match the prefix string.

sql syntax used:
LOAD DATA LOCAL INFILE 'reload.ctl' INTO TABLE mm
FIELDS TERMINATED BY ','  ENCLOSED BY ''
LINES STARTING BY '01'
( ... );
LOAD DATA LOCAL INFILE 'reload.ctl' INTO TABLE vl
FIELDS TERMINATED BY ','  ENCLOSED BY ''
LINES STARTING BY '02'
( ... );

local data file syntax (small excerpt):
012004/03/13-13:12:38,River Forest Il,600 Franklin Ave.,60305,
012004/03/13-13:12:38,Elmhurst Il,149 Clara Pl.,60126,
023,2004/01/02-16:13:04,25,N,60302,
023,2004/01/02-16:21:08,0,N,60302,

my best guess: the bad data appeared to come from the data with the other record prefix string. the prefix match may have a bug in it. not anchored?

How to repeat:
i can forward a complete test script (two commands) and datafile (a few thousand records) which caused the original problem.

test data for #3632, load data ... starting by

Attachment: bug.zip (application/x-zip-compressed, text), 105.06 KiB.

Thank you for taking the time to write to us, but this is not
a bug. Please double-check the documentation available at
http://www.mysql.com/documentation/ and the instructions on
how to report a bug at http://bugs.mysql.com/how-to-report.php

Additional info:

I took a quick look a the test case and found some errors in it:

The test data was terminated with \r\n, not by \n, which is the default line terminator.

Another problem was that LINES STARTING BY xxx means that the MySQL will assume the lines start is at the next occurence of xxx. In the test you used 01 and 02 as LINES STARTING BY, but as the data also included 01 and 02 MySQL did read in some data from these lines.

Fix:

Change that lines starts with X1 and X2 instead of 01 and 02

Use the command:

LOAD DATA LOCAL INFILE 'bug.dat' INTO TABLE mm FIELDS TERMINATED BY ','  ENCLOSED BY 0x10 LINES STARTING BY 'X1' TERMINATED BY "\r\n" (CD,CS,ED,EM,F2,F3,FN,L2,L3,LN,LT,LV,MA,MC,PH,SD,ST,ZC);

To read in the data.

I will update the MySQL manual to make this the LINES STARTING BY option more clear.

Regards,
Monty

The exact text from the mySQL manual is:
"If all the lines you want to read in has a common prefix that you want
to skip, you can use LINES STARTING BY prefix_string for this. In other
words, the defaults cause LOAD DATA INFILE to act as follows when
reading input: · Look for line boundaries at newlines. · If LINES
STARTING BY prefix is used, read until prefix is found and start reading
at character after prefix. If line doesn't include prefix it will be
skipped."

The key phrase is "...all the lines you want to read in [have] a common prefix that you want to skip..."

A prefix is defined as something that comes before everything else. When saying that lines have a prefix, you are saying that the prefix occurs before everything else on the line.

I question whether the code should be changed to usefully represent the doc, or the doc changed to suit a less-useful coding.

In the real world, we find that records may have record codes in fixed positions, generally (but not always) at the start of the record -- a prefix. however, other LOAD DATA solutions include a means to specify the exact position of that code. It should NOT be left to chance to find the "prefix" just anywhere in the record, and it is unrealistic to assert that one should change the data to suit mySQL's behavior.

Here are my suggestions:
(1) let LINES STARTING BY 'X' continue to mean "line contains 'X' anywhere and data prior to 'X' in the record is skipped" and document this as such.
(2) add LINES STARTING BY 'X' POSITION N which means "line contains 'X' beginning at character position N relative to the beginning of the record.

thanks. if you suggest it, i will submit this idea in any formal manner.

Please convert to feature request.

Many thanks for writing a feature request. Our development will discuss this.

Ping.  Checking to see if this feature request can be considered.

Review: allow LINES STARTED BY to have an additional syntax to specify fixed-position prefix.