Bug #20212 | load data infile does not properly load data containing accented characters | ||
---|---|---|---|
Submitted: | 1 Jun 2006 19:46 | Modified: | 12 Jun 2009 6:42 |
Reporter: | Philip Blignaut | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | MySQL Server | Severity: | S2 (Serious) |
Version: | 5.0.18-nt-max | OS: | Windows (Windows 2003) |
Assigned to: | CPU Architecture: | Any |
[1 Jun 2006 19:46]
Philip Blignaut
[2 Jun 2006 8:28]
Valeriy Kravchuk
Thank you for a problem report. Please, upload your file to load, exact CREATE TABLE statement and LOAD DATA statement you used. Take a look at bug #14477, please. It looks similar for me.
[2 Jun 2006 16:52]
Philip Blignaut
Create table syntax: CREATE TABLE `txxx` ( `xxid` int(11) NOT NULL default '0', `xxtext` varchar(64) NOT NULL default '', PRIMARY KEY (`xxid`) ) ENGINE=MyISAM DEFAULT CHARSET=utf1; Load data syntax: load data infile 'xxx.txt' into table txxx
[2 Jun 2006 16:53]
Philip Blignaut
Data file used in Load data
Attachment: xxx.txt (text/plain), 7 bytes.
[2 Jun 2006 16:55]
Philip Blignaut
Oops! Change utf1 to utf8 in create table syntax
[2 Jun 2006 17:54]
MySQL Verification Team
I tested your test case with server released 5.0.22 without any problem. Could you please try with that server. Thanks in advance.
[2 Jun 2006 21:28]
Philip Blignaut
Since I reported this issue I have tried just about all of the different charsets. Some load all the data into the fields but replace accented characters with a questionmark (?). As for trying it on 5.0.22, I'm afraid I will have to convince my service provider to upgrade. Unfortunately I have no control over the version of MySQL that they provide me with. I have found a workaround. When I create the files for upload, I replace all the accented characters (including the '!' character) with a '!' and two or more hex characters that represent the accented character. After the load data infile, I then do update tablename set fieldname=replace(replace(replace...(fieldname,'!xx','<acc char>'),'!xx',<acc char>....), fieldname=..., fieldname=... for each field on each table. I was afraid that this would make my script timeout on the SQL, but MySQL is fast enough to do this on 45 tables with an average of a 1000 records per table and 10 fields per table! Even though I think this is still a bug, you are welcome to close it. My workaround will last me until my service provider does an upgrade. Kind regards, Philip
[3 Jun 2006 11:12]
Valeriy Kravchuk
I'll put this report in the "Need Feedback" state then. Please, reopen it if you'll get similar problem with 5.0.22.
[30 Jun 2006 22:20]
Dominique BOYER
Same problem with different result : Error depend on the table charset : in utf-8 : Data too long for column in cp1250 : special chars like é, à ... are replaced by a ?
[30 Jun 2006 22:21]
Dominique BOYER
oups forgot to say that this is on 5.0.22 version.
[1 Jul 2006 7:59]
Valeriy Kravchuk
Dominique, What do you mean by "same problem"? Same test case? Please, send exact CREATE TABLE and data to load. Specify exact character sets used.
[2 Jul 2006 8:00]
Dominique BOYER
Same test case, After my tests : Problem on 5.0.18 (Data truncated at position of the special char éèù ...) No problem with 5.0.20. Problem with 5.0.22 (Data too long for column with text containing special chars éèù ...) with parameters : auto_increment_increment 1 auto_increment_offset 1 automatic_sp_privileges ON back_log 50 basedir I:\Program Files\MySQL\MySQL Server 5.0\ binlog_cache_size 32768 bulk_insert_buffer_size 8388608 character_set_client utf8 character_set_connection utf8 character_set_database utf8 character_set_filesystem binary character_set_results utf8 character_set_server utf8 character_set_system utf8 character_sets_dir I:\Program Files\MySQL\MySQL Server 5.0\share\charsets\ collation_connection utf8_general_ci collation_database utf8_general_ci collation_server utf8_general_ci completion_type 0 concurrent_insert 1 connect_timeout 5 datadir I:\Program Files\MySQL\MySQL Server 5.0\Data\ date_format %Y-%m-%d datetime_format %Y-%m-%d %H:%i:%s default_week_format 0 delay_key_write ON delayed_insert_limit 100 delayed_insert_timeout 300 delayed_queue_size 1000 div_precision_increment 4 engine_condition_pushdown OFF expire_logs_days 0 flush OFF flush_time 1800 ft_boolean_syntax + -><()~*:""&| ft_max_word_len 84 ft_min_word_len 4 ft_query_expansion_limit 20 ft_stopword_file (built-in) group_concat_max_len 1024 have_archive YES have_bdb NO have_blackhole_engine NO have_compress YES have_crypt NO have_csv NO have_example_engine NO have_federated_engine NO have_geometry YES have_innodb YES have_isam NO have_ndbcluster NO have_openssl DISABLED have_query_cache YES have_raid NO have_rtree_keys YES have_symlink YES init_connect init_file init_slave innodb_additional_mem_pool_size 6291456 innodb_autoextend_increment 8 innodb_buffer_pool_awe_mem_mb 0 innodb_buffer_pool_size 290455552 innodb_checksums ON innodb_commit_concurrency 0 innodb_concurrency_tickets 500 innodb_data_file_path ibdata1:10M:autoextend innodb_data_home_dir innodb_doublewrite ON innodb_fast_shutdown 1 innodb_file_io_threads 4 innodb_file_per_table OFF innodb_flush_log_at_trx_commit 1 innodb_flush_method innodb_force_recovery 0 innodb_lock_wait_timeout 50 innodb_locks_unsafe_for_binlog OFF innodb_log_arch_dir innodb_log_archive OFF innodb_log_buffer_size 3145728 innodb_log_file_size 58720256 innodb_log_files_in_group 2 innodb_log_group_home_dir .\ innodb_max_dirty_pages_pct 90 innodb_max_purge_lag 0 innodb_mirrored_log_groups 1 innodb_open_files 300 innodb_support_xa ON innodb_sync_spin_loops 20 innodb_table_locks ON innodb_thread_concurrency 8 innodb_thread_sleep_delay 10000 interactive_timeout 28800 join_buffer_size 131072 key_buffer_size 8388608 key_cache_age_threshold 300 key_cache_block_size 1024 key_cache_division_limit 100 language I:\Program Files\MySQL\MySQL Server 5.0\share\english\ large_files_support ON large_page_size 0 large_pages OFF license GPL local_infile ON log OFF log_bin OFF log_bin_trust_function_creators OFF log_error .\domz.err log_slave_updates OFF log_slow_queries OFF log_warnings 1 long_query_time 10 low_priority_updates OFF lower_case_file_system OFF lower_case_table_names 1 max_allowed_packet 1048576 max_binlog_cache_size 4294967295 max_binlog_size 1073741824 max_connect_errors 10 max_connections 100 max_delayed_threads 20 max_error_count 64 max_heap_table_size 16777216 max_insert_delayed_threads 20 max_join_size 4294967295 max_length_for_sort_data 1024 max_prepared_stmt_count 16382 max_relay_log_size 0 max_seeks_for_key 4294967295 max_sort_length 1024 max_sp_recursion_depth 0 max_tmp_tables 32 max_user_connections 0 max_write_lock_count 4294967295 multi_range_count 256 myisam_data_pointer_size 6 myisam_max_sort_file_size 107374182400 myisam_recover_options OFF myisam_repair_threads 1 myisam_sort_buffer_size 105906176 myisam_stats_method nulls_unequal named_pipe OFF net_buffer_length 16384 net_read_timeout 30 net_retry_count 10 net_write_timeout 60 new OFF old_passwords OFF open_files_limit 622 optimizer_prune_level 1 optimizer_search_depth 62 pid_file I:\Program Files\MySQL\MySQL Server 5.0\Data\domz.pid prepared_stmt_count 0 port 3306 preload_buffer_size 32768 protocol_version 10 query_alloc_block_size 8192 query_cache_limit 1048576 query_cache_min_res_unit 4096 query_cache_size 50331648 query_cache_type ON query_cache_wlock_invalidate OFF query_prealloc_size 8192 range_alloc_block_size 2048 read_buffer_size 61440 read_only OFF read_rnd_buffer_size 258048 relay_log_purge ON relay_log_space_limit 0 rpl_recovery_rank 0 secure_auth OFF shared_memory OFF shared_memory_base_name MYSQL server_id 0 skip_external_locking ON skip_networking OFF skip_show_database OFF slave_compressed_protocol OFF slave_load_tmpdir C:\WINDOWS\TEMP\ slave_net_timeout 3600 slave_skip_errors OFF slave_transaction_retries 10 slow_launch_time 2 sort_buffer_size 262136 sql_mode STRICT_TRANS_TABLES,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION sql_notes ON sql_warnings ON storage_engine InnoDB sync_binlog 0 sync_frm ON system_time_zone Paris, Madrid (heure d' table_cache 256 table_lock_wait_timeout 50 table_type InnoDB thread_cache_size 8 thread_stack 196608 time_format %H:%i:%s time_zone SYSTEM timed_mutexes OFF tmp_table_size 53477376 tmpdir transaction_alloc_block_size 8192 transaction_prealloc_size 4096 tx_isolation REPEATABLE-READ updatable_views_with_limit YES version 5.0.22-community-nt version_comment MySQL Community Edition (GPL) version_compile_machine ia32 version_compile_os Win32 wait_timeout 28800
[3 Jul 2006 23:00]
Bugs System
No feedback was provided for this bug for over a month, so it is being suspended automatically. If you are able to provide the information that was originally requested, please do so and change the status of the bug back to "Open".
[11 Jun 2009 13:36]
Eugen Ostrowski
I ran into the same problem with German Umlaute i.e. öäüß. I found out that its a problem with UTF8 conversion of CP1250 charactersets. MySQL's LOAD DATA didn't support characterset conversion but INSERT does. The issue I think occurs on Windows machines which use CP1250 by default. There are two workarounds I tested both. Use a character set converter (http://www.heise.de/software/download/character_set_converter/41848) before uploading the textfile (a UTF8 textfile has the desired behaviour) or use INSERT INTO [table] (...); I don't think that the issue is bug. It's a documentation error. Please insert some hints in the description of LOAD DATA!
[12 Jun 2009 6:42]
Susanne Ebrecht
Many thanks for your feedback. This is not a bug. It is just wrong client character settings. Let me explain it: The client-server communication is not so different from communication between two real persons when one person has native language x and the other y. Let us just say one person speaks German and the second person speaks English. First of all you have to figure out a common language that you will use. That is the same between server and client. The server needs to know which language the client is speaking. Using MySQL CLI the CLI is using the language that the terminal is using. On a German Windows it is code page 850 on a Linux today usually it is utf8. First of all you have to tell the server, which language your client is using. There for in MySQL you need to say: SET NAMES <encoding_of_terminal>; Means on a German Windows: SET NAMES CP850; and on a utf8 Linux: SET NAMES UTF8; When you upload a file then you need to set this to the encoding of your file. You have to figure out, which encoding is used for storing the file. If your file is stored as ISO-8859-15 then you need to use: SET NAMES latin1; before you use LOAD DATA. Otherwise the system is not able to convert your characters correct because it is not able to guess the encoding of the file. In real life, when you have a text and don't know if English or German you will already fail with this; "die server". In German it just means "the servers" and in English it means the server should die. Very difference sense.
[20 Jun 2009 2:27]
Eugen Ostrowski
@ Susanne Ebrecht Well, I agree: This thread is not about a bug. But not for the reasons you have stated. First of all to mention: There is no conversion between codepages if you use LOAD DATA wether you use SET NAMES or not. This behaviour is highly desirable if you import data. For example if you migrate a legacy DB and put data into new fields transformation would not be welcome because you want the same bits in the fields of the new DB. Further there is a flaw concerning Mickysoft codepages. MS DOS uses CP850 (http://www.gymel.com/charsets/CP850.html). But MS Windows has CP1250 (http://www.gymel.com/charsets/CP1250.html) with minor differences especially when regarding special characters. So one has to be careful when migrating to utf8. The INSERT statement acts differently. Here transformation occurs depending on the SET NAMES statement. So the issue of this thread is about documentation and not behaviour.