Bug #40696 The CSV import routine intermitantly drops some characters.
Submitted: 13 Nov 2008 11:06 Modified: 24 Mar 2009 22:17
Reporter: D Peters Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S3 (Non-critical)
Version:5.0.51a, 5.0.67, and 5.1.29-rc OS:Any (Knoppix 5.3 and OSX (10.5.5))
Assigned to: CPU Architecture:Any

[13 Nov 2008 11:06] D Peters
Description:
I intended import some data, from HUD's Insured Multifamily Mortgages database (a dBASE III file), into mySQL--only using FOSS.

I used OpenOffice to extract the data from the dBASE III file, and exported the results as a CSV file.

Afterwards, I attempted to import that CSV file into mySQL.  Yet, I noticed that the last character of several fields (prop_state, proj_zip, term_mo, hold_state, serv_state, and soa) is omitted consistently--but is not omitted for the other fields.

I first noticed this behavior, while working on my Mac, using the production-ready version of mysql (downloaded on 10/25/2008).  So, I reran my tests on another machine using an older and a newer (a release candidate) version of mysql, and I got the same results in both cases.

How to repeat:
# I found the HUD info:  http://www.hud.gov/offices/hsg/comp/rpts/mfh/mf_f47.cfm

# download the DBASE III file to import into mySQL
wget http://www.hud.gov/offices/hsg/comp/rpts/mfh/f47d.exe

# extracts the DBF file
(cd /tmp/&& unzip -q ~knoppix/f47d.exe ; ls )

# open the DBF file (/tmp/F47D.DBF) in OpenOffice, and save as CSV (/tmp/F47D.csv)

# set up mysql (on a non-production machine--only for testing)
sudo sed -i '/datadir/ s#= .*$#= /opt/mysql#' /etc/mysql/my.cnf
sudo mv /var/lib/mysql /opt/
sudo /etc/init.d/mysql start
sudo mysql -u root
  use mysql;
  delete from db; # delete the test accounts
  delete from user where not (host="localhost" and user="root");
  flush privileges;

# steps to reproduce the issue
~knoppix/processCSV.py > /tmp/_my.sql
(_d="/tmp/7"; (mkdir -p ${_d} && cd ${_d}/ && sudo strace -F -f -ff -i -r -tt -T -v -x -o mysql.trace mysql -u root </tmp/_my.sql 2>&1 | tee _res); history | tail -1 > ${_d}/cmd)
mysql -u root housing -e "select count(*) from stats; select * from stats limit 0,5"

# check results against the CSV data
#
# NOTE:  The last character of prop_state, proj_zip, term_mo, hold_state, serv_state, and soa is omitted consistently--but is not omitted for the other fields
head -n 5 /tmp/.F47D.csv
[13 Nov 2008 11:10] D Peters
a python script which generates a SQL file from the CSV to populate the database

Attachment: processCSV.py (application/x-python, text), 2.43 KiB.

[13 Nov 2008 11:13] D Peters
version info

Attachment: versionInfo (text/plain), 1.40 KiB.

[21 Nov 2008 10:08] Sveta Smirnova
Thank you for the report.

Which options you use when export DBF file ac CSV (fields delimiter, fields enclose)? Which version of Python do you use?
[22 Nov 2008 10:35] D Peters
The options I used to export the DBF file as CSV:
  fields delimiter = ,
  text delimiter   = '

I used python versions 2.4.5 and 2.5.1.

I also included the version information of my OS envs and the tools I used in the file,  versionInfo.
[24 Mar 2009 8:43] Sveta Smirnova
We're sorry, but the bug system is not the appropriate forum for asking help on using MySQL products. Your problem is not the result of a bug.

Support on using our products is available both free in our forums at http://forums.mysql.com/ and for a reasonable fee direct from our skilled support engineers at http://www.mysql.com/support/

Thank you for your interest in MySQL.

Thank you for the feedback.

I still get error from your script:

Traceback (most recent call last):
  File "processCSV.py", line 70, in <module>
    if int(ss[2])>8: str+=('19'+ss[2]+'-'+ss[0]+'-'+ss[1]+', ')
IndexError: list index out of range

So it looks like problem of the script for me.

If you disagree please provide test case repeatable only by MySQL means. I believe result sql produced by the script in your environment is enough.
[24 Mar 2009 10:58] D Peters
a run log (one way to reproduce the problem)

Attachment: run-2.log (application/octet-stream, text), 10.03 KiB.

[24 Mar 2009 11:12] D Peters
I don't know how you tried to run my script, but whatever method you used was incorrect.  My script runs just fine both in Linux and OSX, but the script isn't the issue.

I attached a log of all of the commands I ran from the command line.  When executed correctly, my script will generate the following SQL statements to populate the database:
create database housing;
use housing;
DROP TABLE IF EXISTS `stats`;
SET @saved_cs_client     = @@character_set_client;
SET character_set_client = utf8;
CREATE TABLE `stats` (
  `id` int(11) unsigned NOT NULL auto_increment,
  `proj_num` char(8) default NULL,
  `proj_name` char(30) default NULL,
  `proj_str` char(30) default NULL,
  `prop_city` char(19) default NULL,
  `prop_state` char(2) default NULL,
  `proj_zip` char(5) default NULL,
  `units` decimal(19, 5) default NULL,
  `ini_end` date default NULL,
  `fin_end` date default NULL,
  `mort_amt` decimal(19, 5) default NULL,
  `fstpay_dt` date default NULL,
  `matur_dt` date default NULL,
  `term_mo` char(3) default NULL,
  `int_rate` decimal(19, 5) default NULL,
  `curr_pi` decimal(19, 5) default NULL,
  `amort_upb` decimal(19, 5) default NULL,
  `hold_name` char(30) default NULL,
  `hold_city` char(19) default NULL,
  `hold_state` char(2) default NULL,
  `serv_name` char(30) default NULL,
  `serv_city` char(19) default NULL,
  `serv_state` char(2) default NULL,
  `soa` char(3) default NULL,
  `soa_sect` char(100) default NULL,
  `te` char(254) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM AUTO_INCREMENT=121 DEFAULT CHARSET=latin1;
SET character_set_client = @saved_cs_client;
alter table stats AUTO_INCREMENT=1;
load data local infile '/tmp/.F47D.csv' into table stats fields terminated by ',' lines terminated by '\n' (proj_num, proj_name, proj_str, prop_city, prop_state, proj_zip, units, ini_end, fin_end, mort_amt, fstpay_dt, matur_dt, term_mo, int_rate, curr_pi, amort_upb, hold_name, hold_city, hold_state, serv_name, serv_city, serv_state, soa, soa_sect, te);

So please skip the script.  The issue is that the 'load data . . .' statement intermitantly drops some characters.
[24 Mar 2009 12:24] Sveta Smirnova
Thank you for the feedback.

You have spaces after commas in the result file: "...Rockville, MD, 20850, 1..." So when mysql tries to load MD into prop_state field it takes string " MD", then truncates it to " M". Same with proj_zip and other problem columns.
[24 Mar 2009 15:30] D Peters
Sorry about that, . . . I pasted the wrong load statement.

Since I haven't looked at this issue in a while, I also forgot to include the following command (to be run only when manually running the SQL statements):
sed '1 d' /tmp/F47D.csv > F47D-new.csv

NOTE:  I ran that command to remove the first line of the CSV file (which contain the field names).

Here's the correct load statement (assuming $PWD is /tmp/myTest/):
load data local infile '/tmp/myTest/F47D-new.csv' into table stats fields terminated by ',' lines terminated by '\n' (proj_num, proj_name, proj_str, prop_city, prop_state, proj_zip, units, ini_end, fin_end, mort_amt, fstpay_dt, matur_dt, term_mo, int_rate, curr_pi, amort_upb, hold_name, hold_city, hold_state, serv_name, serv_city, serv_state, soa, soa_sect, te);

Then compare the results of the following query and statement:
mysql> select * from stats limit 0,5;
Macintosh:myTest dp2$ head -n 5 /tmp/myTest/F47D-new.csv

In row 1, you'll see the following:
proj_num is '0001001, and it should be '00010010'
proj_name is 'COLLINGSWOOD NURSING HOME', and that's correct
proj_str is '299 HURLEY AVENUE', and that's correct
...
proj_state is 'M, and it should be 'MD'
...

The point is that the load statement seems to intermittently drop chars, or it's possibly incorrectly counting the quotes (') as part of the actual value of the field.  I suspect it might be the latter.
[24 Mar 2009 18:19] Sveta Smirnova
Thank you for the feedback, but sed '1 d' does not remove spaces between comma and character you want to insert into table, so it does not fix issue. Additionally you can not say fields terminated by ', ' (2 characters), but only by ',' or any other single character in the load data statement. So this is not a bug.
[24 Mar 2009 22:17] D Peters
So, you're basically telling me that the following behavior--which I find to be rather surprising--is correct?  I compare the results of the load statement with the results of an insert statement (ran 2 ways:  with ini_end specified as mm/dd/yy, and with it specified as 'yyyy-mm-dd' [not the format oocalc uses for dates in its CSVs]).

Please keep in mind that there are no unquoted spaces here.  Stated another way, any spaces present are a part of a string.  oocalc doesn't generate anything for NULL values (hence you'll see "... '20850',160,05/21/08,, ..." where fin_end is null).  Incidentally, I also noticed that oocalc specifies dates in the CSV in a format (mm/dd/yy) that the load statement ignores.

knoppix@Knoppix:~/mySQL-issue-2008.11.07-mod$ sed '1 d' F47D.csv | head -n 1
'00010010','COLLINSWOOD NURSING HOME','299 HURLEY AVENUE','ROCKVILLE','MD','20850',160,05/21/08,,482700,02/01/09,01/01/42,'396',6.95,3109.71,482700,'WELLS FARGO BANK NA','MC LEAN','VA','WELLS FARGO BANK NA','MC LEAN','VA','ZSQ','241(a)/ 232 /Improvements & Additions / Nursing Homes',
knoppix@Knoppix:~/mySQL-issue-2008.11.07-mod$ sed '1 d' F47D.csv | head -n 1 >> F47D-line1.csv; pwd
/home/knoppix/mySQL-issue-2008.11.07-mod
knoppix@Knoppix:~/mySQL-issue-2008.11.07-mod$ less _my.sql 
knoppix@Knoppix:~/mySQL-issue-2008.11.07-mod$ fg
mysql -u root

mysql> create database housing;
Query OK, 1 row affected (0.00 sec)

mysql> use housing;
Database changed
mysql> DROP TABLE IF EXISTS `stats`;
Query OK, 0 rows affected, 1 warning (0.00 sec)

mysql> SET @saved_cs_client     = @@character_set_client;
Query OK, 0 rows affected (0.00 sec)

mysql> SET character_set_client = utf8;
Query OK, 0 rows affected (0.00 sec)

mysql> CREATE TABLE `stats` (
    ->   `id` int(11) unsigned NOT NULL auto_increment,
    ->   `proj_num` char(8) default NULL,
    ->   `proj_name` char(30) default NULL,
    ->   `proj_str` char(30) default NULL,
    ->   `prop_city` char(19) default NULL,
    ->   `prop_state` char(2) default NULL,
    ->   `proj_zip` char(5) default NULL,
    ->   `units` decimal(19, 5) default NULL,
    ->   `ini_end` date default NULL,
    ->   `fin_end` date default NULL,
    ->   `mort_amt` decimal(19, 5) default NULL,
    ->   `fstpay_dt` date default NULL,
    ->   `matur_dt` date default NULL,
    ->   `term_mo` char(3) default NULL,
    ->   `int_rate` decimal(19, 5) default NULL,
    ->   `curr_pi` decimal(19, 5) default NULL,
    ->   `amort_upb` decimal(19, 5) default NULL,
    ->   `hold_name` char(30) default NULL,
    ->   `hold_city` char(19) default NULL,
    ->   `hold_state` char(2) default NULL,
    ->   `serv_name` char(30) default NULL,
    ->   `serv_city` char(19) default NULL,
    ->   `serv_state` char(2) default NULL,
    ->   `soa` char(3) default NULL,
    ->   `soa_sect` char(100) default NULL,
    ->   `te` char(254) default NULL,
    ->   PRIMARY KEY  (`id`)
    -> ) ENGINE=MyISAM AUTO_INCREMENT=121 DEFAULT CHARSET=latin1;
Query OK, 0 rows affected (0.00 sec)

mysql> SET character_set_client = @saved_cs_client;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table stats AUTO_INCREMENT=1;
Query OK, 0 rows affected (0.01 sec)
Records: 0  Duplicates: 0  Warnings: 0

mysql> load data local infile '/home/knoppix/mySQL-issue-2008.11.07-mod/F47D-line1.csv' into table stats fields terminated by ',' lines terminated by '\n' (proj_num, proj_name, proj_str, prop_city, prop_state, proj_zip, units, ini_end, fin_end, mort_amt, fstpay_dt, matur_dt, term_mo, int_rate, curr_pi, amort_upb, hold_name, hold_city, hold_state, serv_name, serv_city, serv_state, soa, soa_sect, te);
Query OK, 1 row affected, 10 warnings (0.00 sec)
Records: 1  Deleted: 0  Skipped: 0  Warnings: 10

mysql> insert into stats(proj_num, proj_name, proj_str, prop_city, prop_state, proj_zip, units, ini_end, fin_end, mort_amt, fstpay_dt, matur_dt, term_mo, int_rate, curr_pi, amort_upb, hold_name, hold_city, hold_state, serv_name, serv_city, serv_state, soa, soa_sect, te) values('12345678','COLLINSWOOD NURSING HOME','299 HURLEY AVENUE','ROCKVILLE','MD','20850',160,05/21/08,NULL,482700,02/01/09,01/01/42,'396',6.95,3109.71,482700,'WELLS FARGO BANK NA','MC LEAN','VA','WELLS FARGO BANK NA','MC LEAN','VA','ZSQ','241(a)/ 232 /Improvements & Additions / Nursing Homes',NULL);
Query OK, 1 row affected (0.00 sec)

mysql> insert into stats(proj_num, proj_name, proj_str, prop_city, prop_state, proj_zip, units, ini_end, fin_end, mort_amt, fstpay_dt, matur_dt, term_mo, int_rate, curr_pi, amort_upb, hold_name, hold_city, hold_state, serv_name, serv_city, serv_state, soa, soa_sect, te) values('98765432','COLLINSWOOD NURSING HOME','299 HURLEY AVENUE','ROCKVILLE','MD','20850',160,'2008-05-21',NULL,482700,02/01/09,01/01/42,'396',6.95,3109.71,482700,'WELLS FARGO BANK NA','MC LEAN','VA','WELLS FARGO BANK NA','MC LEAN','VA','ZSQ','241(a)/ 232 /Improvements & Additions / Nursing Homes',NULL);
Query OK, 1 row affected (0.00 sec)

mysql> select * from stats where proj_name like '%COLLINSWOOD NURSING HOME%';
+----+----------+----------------------------+---------------------+-------------+------------+----------+-----------+------------+------------+--------------+------------+------------+---------+----------+------------+--------------+-----------------------+-----------+------------+-----------------------+-----------+------------+------+---------------------------------------------------------+------+
| id | proj_num | proj_name                  | proj_str            | prop_city   | prop_state | proj_zip | units     | ini_end    | fin_end    | mort_amt     | fstpay_dt  | matur_dt   | term_mo | int_rate | curr_pi    | amort_upb    | hold_name             | hold_city | hold_state | serv_name             | serv_city | serv_state | soa  | soa_sect                                                | te   |
+----+----------+----------------------------+---------------------+-------------+------------+----------+-----------+------------+------------+--------------+------------+------------+---------+----------+------------+--------------+-----------------------+-----------+------------+-----------------------+-----------+------------+------+---------------------------------------------------------+------+
|  1 | '0001001 | 'COLLINSWOOD NURSING HOME' | '299 HURLEY AVENUE' | 'ROCKVILLE' | 'M         | '2085    | 160.00000 | 0000-00-00 | 0000-00-00 | 482700.00000 | 2002-01-09 | 0000-00-00 | '39     |  6.95000 | 3109.71000 | 482700.00000 | 'WELLS FARGO BANK NA' | 'MC LEAN' | 'V         | 'WELLS FARGO BANK NA' | 'MC LEAN' | 'V         | 'ZS  | '241(a)/ 232 /Improvements & Additions / Nursing Homes' |      | 
|  2 | 12345678 | COLLINSWOOD NURSING HOME   | 299 HURLEY AVENUE   | ROCKVILLE   | MD         | 20850    | 160.00000 | 0000-00-00 | NULL       | 482700.00000 | 0000-00-00 | 0000-00-00 | 396     |  6.95000 | 3109.71000 | 482700.00000 | WELLS FARGO BANK NA   | MC LEAN   | VA         | WELLS FARGO BANK NA   | MC LEAN   | VA         | ZSQ  | 241(a)/ 232 /Improvements & Additions / Nursing Homes   | NULL | 
|  3 | 98765432 | COLLINSWOOD NURSING HOME   | 299 HURLEY AVENUE   | ROCKVILLE   | MD         | 20850    | 160.00000 | 2008-05-21 | NULL       | 482700.00000 | 0000-00-00 | 0000-00-00 | 396     |  6.95000 | 3109.71000 | 482700.00000 | WELLS FARGO BANK NA   | MC LEAN   | VA         | WELLS FARGO BANK NA   | MC LEAN   | VA         | ZSQ  | 241(a)/ 232 /Improvements & Additions / Nursing Homes   | NULL | 
+----+----------+----------------------------+---------------------+-------------+------------+----------+-----------+------------+------------+--------------+------------+------------+---------+----------+------------+--------------+-----------------------+-----------+------------+-----------------------+-----------+------------+------+---------------------------------------------------------+------+
3 rows in set (0.00 sec)

mysql>