Bug #64073 | mysql Connector J driver and unicode insertion into DB | ||
---|---|---|---|
Submitted: | 19 Jan 2012 18:10 | Modified: | 6 Apr 2012 17:01 |
Reporter: | David Mosel | Email Updates: | |
Status: | Not a Bug | Impact on me: | |
Category: | Connector / J | Severity: | S3 (Non-critical) |
Version: | mysql-connector-java-5.1.18-bin.jar | OS: | MacOS (linux db mac os x client) |
Assigned to: | CPU Architecture: | Any |
[19 Jan 2012 18:10]
David Mosel
[19 Jan 2012 18:33]
Sveta Smirnova
Thank you for the report. > I can send some source code and test data to repeat the error. Please do.
[19 Jan 2012 18:34]
David Mosel
I uploaded the source isv.tar.gz to the anonymous ftp site ftp ftp://ftp.oracle.com/support/incoming/ Connected to bigip-ftp.oracle.com. ftp> binary 200 Switching to Binary mode. ftp> put isv.tar.gz local: isv.tar.gz remote: isv.tar.gz 229 Entering Extended Passive Mode (|||62195|) 150 Ok to send data. 100% |***********************************| 3540 KiB 121.58 KiB/s 00:00 ETA 226 File receive OK. 3625286 bytes sent in 00:30 (115.58 KiB/s) ftp> bye 221 Goodbye.
[26 Jan 2012 10:01]
Tonci Grgin
David. Can you please reduce your test case to just a code that creates the table and stuff and attach it to this report. Please make the test complete and standalone so I can check. From what I see, your charset_results is set to Latin1 (or similar single-byte charset) thus you're receiving result as sequence of bytes rather than utf8 stream. You can confirm this by looking into general query log on server. If you're using prepared statements for this, use server side ones.
[27 Jan 2012 1:43]
David Mosel
Tonci, Here is the code to create the user and table. It is also stored in text files in the "sql" directory. create table short_abstract( id int not null auto_increment primary key, line_number int, topic varchar(4000), text longtext ) ; create user 'stg_sasca'@'localhost' identified by 'stg_sasca'; create user 'stg_sasca'@'%' identified by 'stg_sasca'; grant select, create, insert, update, delete, drop on stg_sasca.* to 'stg_sasca'@'localhost' ; grant select, create, insert, update, delete, drop on stg_sasca.* to 'stg_sasca'@'%' ; The code should be stand alone. Please edit src/parseload/ShortAbstractPL.java and change the ip number to your ip number. You'll see "_CHANGE_ME_" in the comment and save the file. Then type: ant build then type: parsload.sh test_data.en.nq The test file (test_data.en.nq) contains two records, the first will go into the DB. The second is there to show the error. Thank you.
[7 Feb 2012 9:05]
Tonci Grgin
David, Can you attach test case and the supporting files to the bug report directly and not on Oracle ftp please.
[11 Feb 2012 22:11]
David Mosel
Tonci, Because of the size restrictions on this site, I cannot attach the files directly. The ftp file is large because it includes all the jars needed to compile and run the code. There is actually only a single source file. Please let me know if you have any trouble compiling. It should just be "ant build". The code runs fine with the oracle ojdbc6.jar driver, but with the mysql driver it gives the error I submitted. The code I submitted is much smaller than the working code. I will try to make time to build a smaller version that does not require the libraries. But it would save time if you could use the version I submitted.
[19 Feb 2012 15:49]
David Mosel
Tonci, I've written a new single source file test that does not use any code library except for the mysql Connector/J mysql-connector-java-5.1.18-bin.jar. Create a table called unicode_test: create table unicode_test( text longtext ) ; and make sure everything is utf8 in your database. To compile: extract the source : tar xfz b64073.tar.gz then get a copy of mysql-connector-java-5.1.18-bin.jar and copy into the "b64073/external" directory. then type: ant build To run it type : ./unicodetest.sh and it will give you a usage message. You pass in the IP Address, username, password, and database name on the command line. Here's an example: ./unicodetest.sh 192.168.56.10 stg_sasca stg_sasca stg_sasca It will give the same error originally submitted: java.sql.SQLException: Incorrect string value: '\xF0\x92\x82\x97' for column 'text' at row 1 The issue is with inserting the the Unicode character "Cuneiform Sign EN". Here's a link http://www.fileformat.info/info/unicode/char/12097/index.htm
[19 Feb 2012 15:51]
David Mosel
source to recreate bug #64073
Attachment: b64073.tar.gz (application/x-gzip, text), 3.00 KiB.
[31 Mar 2012 8:35]
Sveta Smirnova
Thank you for the test case, but this is not a bug. You have following errors: 1. This character is 4-bit UTF8 symbol supported in MySQL character encoding utf8mb4, but not in utf8 2. You had not created the table with option charset, therefore it uses default character set which is latin1 and can not store UTF8 symbols. 3. You set connection string in wrong way. To find out how to set it up properly please read bug #57694 (last paragraph is most important): [2 Dec 2010 11:15] Tony Bedford Main docs have been updated with details of using 3-byte and 4-byte utf8 with Connector/J. An entry has also been added to the 5.1.14 changelog: Connector/J mapped both 3-byte and 4-byte UTF8 encodings to the same Java UTF8 encoding. To use 3-byte UTF8 with Connector/J set characterEncoding=utf8 and set useUnicode=true in the connection string. To use 4-byte UTF8 with Connector/J configure the MySQL server with character_set_server=utf8mb4. Connector/J will then use that setting as long as characterEncoding has not been set in the connection string. This is equivalent to autodetection of the character set.
[6 Apr 2012 17:01]
David Mosel
Thanks Sveta. I'm using mysql 5.1 and I read that utf8mb4 was added to 5.5. I'll upgrade to 5.5. Thanks to everyone who looked at this.