Bug #11788 with character_set_* set to utf8 the command line client garbles german umlaute
Submitted: 6 Jul 2005 22:40 Modified: 6 Jul 2005 23:49
Reporter: Jonas Maurus Email Updates:
Status: Not a Bug Impact on me:
None 
Category:MySQL Server: Command-line Clients Severity:S3 (Non-critical)
Version:4.1.12a OS:Windows (Windows XP)
Assigned to: CPU Architecture:Any

[6 Jul 2005 22:40] Jonas Maurus
Description:
It seems that the mysql command-line client garbles characters that are not in the current OS-codepage (Windows-1252 in my case), if the database uses a different character set.

#### console dump
C:\projects\cip>mysql
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 6 to server version: 4.1.12a-nt

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> use cip
Database changed

mysql> show variables like "char%";
+--------------------------+------------------------------------+
| Variable_name            | Value                              |
+--------------------------+------------------------------------+
| character_set_client     | latin1                             |
| character_set_connection | latin1                             |
| character_set_database   | utf8                               |
| character_set_results    | latin1                             |
| character_set_server     | utf8                               |
| character_set_system     | utf8                               |
| character_sets_dir       | C:\devstudio\mysql\share\charsets/ |
+--------------------------+------------------------------------+
7 rows in set (0.00 sec)

mysql> describe messages;
+----------------+--------------+------+-----+---------------------+----------------+
| Field          | Type         | Null | Key | Default             | Extra    |
+----------------+--------------+------+-----+---------------------+----------------+
| id             | int(11)      |      | PRI | NULL                | auto_increment |
| messageSubject | varchar(255) | YES  |     | NULL                |    |
| messageText    | text         | YES  |     | NULL                |    |
| advertised     | datetime     | YES  |     | NULL                |    |
| created        | datetime     |      |     | 0000-00-00 00:00:00 |    |
| seen           | datetime     | YES  |     | NULL                |    |
| invoice_id     | int(11)      | YES  | MUL | NULL                |    |
| customer_id    | int(11)      | YES  | MUL | NULL                |     |
+----------------+--------------+------+-----+---------------------+----------------+
8 rows in set (0.00 sec)

mysql> insert into messages values(0, "Umlauttest2", "äöüÜÖÄß éèâ àá ô", NULL, NOW(), NULL, NULL, 1);
Query OK, 1 row affected, 1 warning (0.04 sec)

mysql> set character set 'utf8';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into messages values(0, "Umlauttest3", "äöüÜÖÄß éèâ àá ô", NULL, NOW(), NULL, NULL, 1);
Query OK, 1 row affected, 1 warning (0.04 sec)

mysql> select * from messages;
+----+----------------+------------------------------------+------------+---------------------+------+------------+-------------+
| id | messageSubject | messageText                        | advertised | created             | seen | invoice_id | customer_id |
+----+----------------+------------------------------------+------------+---------------------+------+------------+-------------+
|  1 | Umlauttest     | ├╝├Â├ñ├£├û├ä├ƒ !"┬º$%&/()=?        | NULL       | 2005-07-07 00:26:01 | NULL |       NULL |           1 |
|  2 | Umlauttest2    | ÔÇ×ÔÇØ?┼íÔäó┼¢├í ÔÇÜ┼áãÆ ÔǪ┬á ÔÇ£ | NULL      | 2005-07-07 00:26:21 | NULL |       NULL |           1 |
|  3 | Umlauttest3    |                                    | NULL       | 2005-07-07 00:26:36 | NULL |       NULL |           1 |
+----+----------------+------------------------------------+------------+---------------------+------+------------+-------------+
3 rows in set (0.01 sec)

The first line was inserted using MySQL Query Browser and works fine, the second and third row was inserted using the command-line client. As you can see: after "set character set 'utf8'" it inserts an empty field.

The same behavior seems to occur when you try to use a SQL-file from the command-line like:

c:\projects\cip>mysql cip < src\conf\data.sql

at least unless data.sql has explicitly been encoded as UTF8 (using ActiveStates Komodo IDE which has excellent character set support). Unfortunately encoding the .sql file with Eclipse, Microsoft Notepad and Metapad all *didn't work* and roughly produced the same result as above. I'd guess it might be that all named products have different definitions of "Unicode"-Encoding (including the BOM or not, using UCS2 encoding instead of UTF8 and so on).

I wasn't able to produce a deterministic test suite of .sql files by now due to time constraints, so the command-line example from above i has to do for now.

How to repeat:
please see above

Suggested fix:
do not use the command-line client for any work with character sets that wasn't already available in mysql 4.0.
[6 Jul 2005 23:49] MySQL Verification Team
The issue here is the terminal:

c:\mysql\bin>mysql -uroot test
Welcome to the MySQL monitor.  Commands end with ; or \g.
Your MySQL connection id is 3 to server version: 4.1.12a-nt

Type 'help;' or '\h' for help. Type '\c' to clear the buffer.

mysql> create table messages (
    -> id int not null auto_increment primary key,
    -> messageSubject varchar(255),
    -> messageText text);
Query OK, 0 rows affected (0.05 sec)

mysql>
mysql> insert into messages values(0, "Umlauttest2", "äöüÜÖÄß éèâ àáô");
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> show warnings;
+---------+------+--------------------------------------------------+
| Level   | Code | Message                                          |
+---------+------+--------------------------------------------------+
| Warning | 1265 | Data truncated for column 'messageText' at row 1 |
+---------+------+--------------------------------------------------+
1 row in set (0.00 sec)

mysql> set character set 'utf8';
Query OK, 0 rows affected (0.00 sec)

mysql> insert into messages values(0, "Umlauttest2", "äöüÜÖÄß éèâ àáô");
Query OK, 1 row affected, 1 warning (0.00 sec)

mysql> select * from messages;
+----+----------------+-----------------------------------+
| id | messageSubject | messageText                       |
+----+----------------+-----------------------------------+
|  1 | Umlauttest2    | ÔÇ×ÔÇØ?┼íÔäó┼¢├í ÔÇÜ┼áãÆ ÔǪ┬áÔÇ£ |
|  2 | Umlauttest2    |                                   |
+----+----------------+-----------------------------------+
2 rows in set (0.00 sec)

mysql> set names cp1251;
Query OK, 0 rows affected (0.00 sec)

mysql> insert into messages values(0, "Umlauttest2", "äöüÜÖÄß éèâ àáô");
Query OK, 1 row affected (0.00 sec)

mysql> select * from messages;
+----+----------------+-----------------+
| id | messageSubject | messageText     |
+----+----------------+-----------------+
|  1 | Umlauttest2    | äö??Ö?? é?? àáô |
|  2 | Umlauttest2    |                 |
|  3 | Umlauttest2    | äöüÜÖÄß éèâ àáô |
+----+----------------+-----------------+
3 rows in set (0.00 sec)

mysql>
[7 Jul 2005 0:58] Jonas Maurus
Sorry, I don't get that. My issue is exactly that I expected the mysql command-line client to silently convert the terminal codepage (cp1252) to utf8 if the database is created with a "default character set 'utf8'" clause (as shown in the report).

As your table output shows, all characters that were inserted with the server expecting unicode have been garbled, or lost.

This means that for multi-language databases any SQL has to include a "set names" - hint to make sure the server expects the character set of the current terminal? or does the server then internally convert the input strings to utf8 and the client effectively looses the data? 

I didn't try the "set names" directive explicitly, but I tried setting charset_set_client and character_set_connection manually before...

I might have misunderstood the relevant documentation, if possible please clarify. Thanks a lot.