Bug #20212 load data infile does not properly load data containing accented characters
Submitted: 1 Jun 2006 19:46 Modified: 12 Jun 2009 6:42
Reporter: Philip Blignaut Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server Severity:S2 (Serious)
Version:5.0.18-nt-max OS:Windows (Windows 2003)
Assigned to: CPU Architecture:Any

[1 Jun 2006 19:46] Philip Blignaut
Description:
When I try and do a
load data local infile "text.txt" into table tabc
all fields load perfectly except fields containing accented characters like é, ê, ë etc. Only characters before the accented character are loaded into the field.
SHOW VARIABLES like 'character%';
gives the following output:
+--------------------------+---------------------------------------------------------+
| Variable_name            | Value                                                   |
+--------------------------+---------------------------------------------------------+
| character_set_client     | utf8                                                    |
| character_set_connection | utf8                                                    |
| character_set_database   | utf8                                                    |
| character_set_results    | utf8                                                    |
| character_set_server     | utf8                                                    |
| character_set_system     | utf8                                                    |
| character_sets_dir       | C:\Program Files\MySQL\MySQL Server 5.0\share\charsets\ |
+--------------------------+---------------------------------------------------------+
The file that I am trying to load is a CSV like file, fields terminated by \t and lines terminated by \n (not \r\n, I checked)

How to repeat:
Create a utf8 database. Create a utf8 table containing a few varchar fields.  Create a text file using an editor that understands the difference between \n end \r\n.  Type a line or two of text, separate fields with tab characters and include one or more accented characters in the text.  Do a load data [local] infile.  Now select all rows in the table.  You will notice that fields that had accented characters in the text file, have been truncated.

Suggested fix:
Allow accented characters when doing load data infile
[2 Jun 2006 8:28] Valeriy Kravchuk
Thank you for a problem report. Please, upload your file to load, exact CREATE TABLE statement and LOAD DATA statement you used.

Take a look at bug #14477, please. It looks similar for me.
[2 Jun 2006 16:52] Philip Blignaut
Create table syntax:
CREATE TABLE `txxx` (
  `xxid` int(11) NOT NULL default '0',
  `xxtext` varchar(64) NOT NULL default '',
  PRIMARY KEY  (`xxid`)
) ENGINE=MyISAM DEFAULT CHARSET=utf1;
Load data syntax:
load data infile 'xxx.txt' into table txxx
[2 Jun 2006 16:53] Philip Blignaut
Data file used in Load data

Attachment: xxx.txt (text/plain), 7 bytes.

[2 Jun 2006 16:55] Philip Blignaut
Oops! Change utf1 to utf8 in create table syntax
[2 Jun 2006 17:54] MySQL Verification Team
I tested your test case with server released 5.0.22 without any problem.
Could you please try with that server. Thanks in advance.
[2 Jun 2006 21:28] Philip Blignaut
Since I reported this issue I have tried just about all of the different charsets.  Some load all the data into the fields but replace accented characters with a questionmark (?).
As for trying it on 5.0.22, I'm afraid I will have to convince my service provider to upgrade.  Unfortunately I have no control over the version of MySQL that they provide me with.
I have found a workaround.  When I create the files for upload, I replace all the accented characters (including the '!' character) with a '!' and two or more hex characters that represent the accented character.  After the load data infile, I then do update tablename set fieldname=replace(replace(replace...(fieldname,'!xx','<acc char>'),'!xx',<acc char>....), fieldname=..., fieldname=... for each field on each table.  I was afraid that this would make my script timeout on the SQL, but MySQL is fast enough to do this on 45 tables with an average of a 1000 records per table and 10 fields per table!
Even though I think this is still a bug, you are welcome to close it.  My workaround will last me until my service provider does an upgrade.

Kind regards,
Philip
[3 Jun 2006 11:12] Valeriy Kravchuk
I'll put this report in the "Need Feedback" state then. Please, reopen it if you'll get similar problem with 5.0.22.
[30 Jun 2006 22:20] Dominique BOYER
Same problem with different result :

Error depend on the table charset :
in utf-8 :
Data too long for column
in cp1250 :
special chars like é, à ... are replaced by a ?
[30 Jun 2006 22:21] Dominique BOYER
oups forgot to say that this is on 5.0.22 version.
[1 Jul 2006 7:59] Valeriy Kravchuk
Dominique,

What do you mean by "same problem"? Same test case? Please, send exact CREATE TABLE and data to load. Specify exact character sets used.
[2 Jul 2006 8:00] Dominique BOYER
Same test case, 

After my tests :
Problem on 5.0.18 (Data truncated at position of the special char éèù ...)
No problem with 5.0.20.
Problem with 5.0.22 (Data too long for column with text containing special chars éèù ...)

with parameters :
auto_increment_increment	1
auto_increment_offset	1
automatic_sp_privileges	ON
back_log	50
basedir	I:\Program Files\MySQL\MySQL Server 5.0\
binlog_cache_size	32768
bulk_insert_buffer_size	8388608
character_set_client	utf8
character_set_connection	utf8
character_set_database	utf8
character_set_filesystem	binary
character_set_results	utf8
character_set_server	utf8
character_set_system	utf8
character_sets_dir	I:\Program Files\MySQL\MySQL Server 5.0\share\charsets\
collation_connection	utf8_general_ci
collation_database	utf8_general_ci
collation_server	utf8_general_ci
completion_type	0
concurrent_insert	1
connect_timeout	5
datadir	I:\Program Files\MySQL\MySQL Server 5.0\Data\
date_format	%Y-%m-%d
datetime_format	%Y-%m-%d %H:%i:%s
default_week_format	0
delay_key_write	ON
delayed_insert_limit	100
delayed_insert_timeout	300
delayed_queue_size	1000
div_precision_increment	4
engine_condition_pushdown	OFF
expire_logs_days	0
flush	OFF
flush_time	1800
ft_boolean_syntax	+ -><()~*:""&|
ft_max_word_len	84
ft_min_word_len	4
ft_query_expansion_limit	20
ft_stopword_file	(built-in)
group_concat_max_len	1024
have_archive	YES
have_bdb	NO
have_blackhole_engine	NO
have_compress	YES
have_crypt	NO
have_csv	NO
have_example_engine	NO
have_federated_engine	NO
have_geometry	YES
have_innodb	YES
have_isam	NO
have_ndbcluster	NO
have_openssl	DISABLED
have_query_cache	YES
have_raid	NO
have_rtree_keys	YES
have_symlink	YES
init_connect	
init_file	
init_slave	
innodb_additional_mem_pool_size	6291456
innodb_autoextend_increment	8
innodb_buffer_pool_awe_mem_mb	0
innodb_buffer_pool_size	290455552
innodb_checksums	ON
innodb_commit_concurrency	0
innodb_concurrency_tickets	500
innodb_data_file_path	ibdata1:10M:autoextend
innodb_data_home_dir	
innodb_doublewrite	ON
innodb_fast_shutdown	1
innodb_file_io_threads	4
innodb_file_per_table	OFF
innodb_flush_log_at_trx_commit	1
innodb_flush_method	
innodb_force_recovery	0
innodb_lock_wait_timeout	50
innodb_locks_unsafe_for_binlog	OFF
innodb_log_arch_dir	
innodb_log_archive	OFF
innodb_log_buffer_size	3145728
innodb_log_file_size	58720256
innodb_log_files_in_group	2
innodb_log_group_home_dir	.\
innodb_max_dirty_pages_pct	90
innodb_max_purge_lag	0
innodb_mirrored_log_groups	1
innodb_open_files	300
innodb_support_xa	ON
innodb_sync_spin_loops	20
innodb_table_locks	ON
innodb_thread_concurrency	8
innodb_thread_sleep_delay	10000
interactive_timeout	28800
join_buffer_size	131072
key_buffer_size	8388608
key_cache_age_threshold	300
key_cache_block_size	1024
key_cache_division_limit	100
language	I:\Program Files\MySQL\MySQL Server 5.0\share\english\
large_files_support	ON
large_page_size	0
large_pages	OFF
license	GPL
local_infile	ON
log	OFF
log_bin	OFF
log_bin_trust_function_creators	OFF
log_error	.\domz.err
log_slave_updates	OFF
log_slow_queries	OFF
log_warnings	1
long_query_time	10
low_priority_updates	OFF
lower_case_file_system	OFF
lower_case_table_names	1
max_allowed_packet	1048576
max_binlog_cache_size	4294967295
max_binlog_size	1073741824
max_connect_errors	10
max_connections	100
max_delayed_threads	20
max_error_count	64
max_heap_table_size	16777216
max_insert_delayed_threads	20
max_join_size	4294967295
max_length_for_sort_data	1024
max_prepared_stmt_count	16382
max_relay_log_size	0
max_seeks_for_key	4294967295
max_sort_length	1024
max_sp_recursion_depth	0
max_tmp_tables	32
max_user_connections	0
max_write_lock_count	4294967295
multi_range_count	256
myisam_data_pointer_size	6
myisam_max_sort_file_size	107374182400
myisam_recover_options	OFF
myisam_repair_threads	1
myisam_sort_buffer_size	105906176
myisam_stats_method	nulls_unequal
named_pipe	OFF
net_buffer_length	16384
net_read_timeout	30
net_retry_count	10
net_write_timeout	60
new	OFF
old_passwords	OFF
open_files_limit	622
optimizer_prune_level	1
optimizer_search_depth	62
pid_file	I:\Program Files\MySQL\MySQL Server 5.0\Data\domz.pid
prepared_stmt_count	0
port	3306
preload_buffer_size	32768
protocol_version	10
query_alloc_block_size	8192
query_cache_limit	1048576
query_cache_min_res_unit	4096
query_cache_size	50331648
query_cache_type	ON
query_cache_wlock_invalidate	OFF
query_prealloc_size	8192
range_alloc_block_size	2048
read_buffer_size	61440
read_only	OFF
read_rnd_buffer_size	258048
relay_log_purge	ON
relay_log_space_limit	0
rpl_recovery_rank	0
secure_auth	OFF
shared_memory	OFF
shared_memory_base_name	MYSQL
server_id	0
skip_external_locking	ON
skip_networking	OFF
skip_show_database	OFF
slave_compressed_protocol	OFF
slave_load_tmpdir	C:\WINDOWS\TEMP\
slave_net_timeout	3600
slave_skip_errors	OFF
slave_transaction_retries	10
slow_launch_time	2
sort_buffer_size	262136
sql_mode	STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION
sql_notes	ON
sql_warnings	ON
storage_engine	InnoDB
sync_binlog	0
sync_frm	ON
system_time_zone	Paris, Madrid (heure d'
table_cache	256
table_lock_wait_timeout	50
table_type	InnoDB
thread_cache_size	8
thread_stack	196608
time_format	%H:%i:%s
time_zone	SYSTEM
timed_mutexes	OFF
tmp_table_size	53477376
tmpdir	
transaction_alloc_block_size	8192
transaction_prealloc_size	4096
tx_isolation	REPEATABLE-READ
updatable_views_with_limit	YES
version	5.0.22-community-nt
version_comment	MySQL Community Edition (GPL)
version_compile_machine	ia32
version_compile_os	Win32
wait_timeout	28800
[3 Jul 2006 23:00] Bugs System
No feedback was provided for this bug for over a month, so it is
being suspended automatically. If you are able to provide the
information that was originally requested, please do so and change
the status of the bug back to "Open".
[11 Jun 2009 13:36] Eugen Ostrowski
I ran into the same problem with German Umlaute i.e. öäüß.
I found out that its a problem with UTF8 conversion of CP1250 charactersets. MySQL's LOAD DATA didn't support characterset conversion but INSERT does. The issue I think occurs on Windows machines which use CP1250 by default. There are two workarounds I tested both. Use a character set converter (http://www.heise.de/software/download/character_set_converter/41848) before uploading the textfile (a UTF8 textfile has the desired behaviour) or use INSERT INTO [table] (...);
I don't think that the issue is bug. It's a documentation error.
Please insert some hints in the description of LOAD DATA!
[12 Jun 2009 6:42] Susanne Ebrecht
Many thanks for your feedback.

This is not a bug. It is just wrong client character settings.

Let me explain it:

The client-server communication is not so different from communication between two real persons when one person has native language x and the other y. Let us just say one person speaks German and the second person speaks English.

First of all you have to figure out a common language that you will use.

That is the same between server and client. The server needs to know which language the client is speaking.

Using MySQL CLI the CLI is using the language that the terminal is using. On a German Windows it is code page 850 on a Linux today usually it is utf8.

First of all you have to tell the server, which language your client is using. There for in MySQL you need to say:

SET NAMES <encoding_of_terminal>;

Means on a German Windows:

SET NAMES CP850;

and on a utf8 Linux:

SET NAMES UTF8;

When you upload a file then you need to set this to the encoding of your file.

You have to figure out, which encoding is used for storing the file.

If your file is stored as ISO-8859-15 then you need to use:

SET NAMES latin1;

before you use LOAD DATA.

Otherwise the system is not able to convert your characters correct because it is not able to guess the encoding of the file.

In real life, when you have a text and don't know if English or German you will already fail with this; "die server". In German it just means "the servers" and in English it means the server should die. Very difference sense.
[20 Jun 2009 2:27] Eugen Ostrowski
@ Susanne Ebrecht
Well, I agree: This thread is not about a bug. But not for the reasons you have stated.

First of all to mention: There is no conversion between codepages if you use LOAD DATA wether you use SET NAMES or not. This behaviour is highly desirable if you import data. For example if you migrate a legacy DB and put data into new fields transformation would not be welcome because you want the same bits in the fields of the new DB.

Further there is a flaw concerning Mickysoft codepages. MS DOS uses CP850 (http://www.gymel.com/charsets/CP850.html). But MS Windows has CP1250 (http://www.gymel.com/charsets/CP1250.html) with minor differences especially when regarding special characters. So one has to be careful when migrating to utf8.

The INSERT statement acts differently. Here transformation occurs depending on the SET NAMES statement.

So the issue of this thread is about documentation and not behaviour.