MySQL Bugs: #49203: Latin1 German collation using CCSID 1148 in DB2

Bug #49203	Latin1 German collation using CCSID 1148 in DB2
Submitted:	30 Nov 2009 11:19	Modified:	1 Jul 2010 9:34
Reporter:	Martin Hieden	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: DB2SE for IBM i	Severity:	S3 (Non-critical)
Version:	5.1.39	OS:	IBM i
Assigned to:	Tim Clark	CPU Architecture:	Any
Tags:	latin1_german1_ci

Description:
Creating a table with latin1_german1_ci collation it results in the wrong CCSID in DB2i.

To use it properly the field should have a CCSID of 1141 (german), instead it has 1148 (international).

So you'd have to convert special characters yourself.

How to repeat:
CREATE DATABASE `coll` DEFAULT CHARACTER SET latin1 COLLATE latin1_german1_ci;
CREATE TABLE `coll`.`test_tbl` (
`test_fld` VARCHAR( 10 ) CHARACTER SET latin1 COLLATE latin1_german1_ci NOT NULL
)
ENGINE = IBMDB2I CHARACTER SET latin1 COLLATE latin1_german1_ci;

Can you provide an example of problematic data and expand more on the specific problem that is occurring? Is the problem when inserting data from DB2? Or are you unable to insert specific characters from MySQL? I believe that CCSIDs 1141 and 1148 have the same characters. They are just mapped to different code points.

Sorry for not being clear enough. You're right about the same characters in both CCSIDs 1141 and 1148. But the problem is, that they switched places, so I won't get the same characters from both sides (MySQL and DB2).

1)
Inserting data via MySQL:
INSERT INTO `coll`.`test_tbl` (
`test_fld` 
)
VALUES (
'äöüÄÖÜß'
);

Results in the following data in DB2:
{¦}[\]~   
4CD6EF5
3CC3CC9

===========

2)
Inserting data via DB2:
INSERT INTO "coll"."test_tbl" VALUES('äöüÄÖÜß')

Results in the following data in MySQL:
{¦}[\]~
C6D4E5A
0A0A0A1

Currently, the IBMDB2I storage engine is coded to use only the character set when determining the appropriate CCSID. The collation is used to determine the sort sequence but does not affect the CCSID. For this reason, all of the latin1_* collations end up as CCSID 1148. Changing this behavior comprehensively would require significant changes to the storage engine and new sort sequences in DB2 and would have uncertain backward compatibility.

If your usage of the DB2 tables requires the hex values of the data to be the equivalent of CCSID 1141, one solution may be to put a view over the fields so that the conversion to 1141 is done implicitly through the view.

If I understand you correctly, MySQL and/or IBM won't do anything to fix this problem.

May I suggest a solution nevertheless? For me it would be sufficient, if the CCSID was not hardcoded but exported into a config variable (i.e. ibmdb2i.latin1_ccsid) with a default value of 1148. No problems with backwards compatibility should arise therefore and no major code changes to set the CCSID accordingly to the collation are needed. As far as I understand the sorting tables, the ones used for 1148 could be used for 1141 without modification, since they contain the same characters, as you already pointed out.

Unfortunately the suggested solution is not so simple. Sort sequences are built with associations between the hex value of characters and their weights.  So even though the same characters are in the CCSIDs, they're not at the same hex position and you'd end up with unexpected sorting. To make this work, a new sort sequence would need to be provided. Furthermore, in order to obtain the correct iconv descriptors to perform the EBCDIC<->ASCII conversion, each possible value for a theoretical ibmdb2i.latin1_ccsid would have to be hard-coded into the storage engine. (You can see how this is done for 1148 in the db2i_charsetSupport.cc module today.) 

It seems that having the expected hex values is important to your application. Is there a reason that creating a DB2 view over the table to provide these hex values is not a reasonable workaround?

This is really heavy stuff and I want to make some comments here.

MySQL itself implemented charset and collations by its own.

MySQL is more decades old and once started with a single native Swedish speaking developer.

Swedish is supported from LATIN1.

This means, LATIN1 was the first implemented character set and latin1_swedish_ci the first collation.

At the middle of second half of last century, when monitors were green or amber and most computers only support capital letters getting informations about encoding/charset/collation was difficult.

Today everybody is able to use wikipedia and quickly will find the correct definition of LATIN1:

http://de.wikipedia.org/wiki/ISO_8859-1 (German version)
http://en.wikipedia.org/wiki/ISO/IEC_8859-1 (English version)

You see LATIN1 should be ISO-8859-1

Hex values of äöüÄÖÜß using LATIN1 should be E4 F6 FC C4 D6 DC DF

Unfortunately, MySQL didn't copy ISO-8859-1 here for its LATIN1 implementation. Microsoft Codepage 1252 was copied here. Which is almost similar to ISO-8859-1.

It is still on our to do list to implement LATIN9 which will be similar to ISO-8859-15.

IBM always have had its own standards here and not just copied from ISO.

Anyway, on topic LATIN1 MySQL is much closer to ISO.

Neither we nor IBM will be able to change LATIN1. For MySQL I can say that we will get huge trouble when we just would try to change a single sign. Incompatibility to older MySQL versions just would be one of the follows.

My guess is that same problems and trouble will occur when IBM trys to change something here.

German collation ...

that is another huge topic.

Ad hoc I remember 6 European Countries where German is one of the native languages; Austria, Belgium, Germany, Liechtenstein,Luxembourg and Switzerland.

Already Germany has two different sorting rules. One made by Konrad Duden and the other originally made for phone books.

You will find these two rules here (only in German):
http://de.wikipedia.org/wiki/DIN_5007

Of course Austria has different sorting rules and Switzerland has different sorting rules from Austria and Germany. Additionally, one of the smaller countries (afair Liechtenstein) has its own rules.

For LATIN1 + German MySQL implemented correctly the Germany phone book rules (DIN 5007-2) and called it latin1_german2_ci.

The MySQL collation latin1_german1_ci always just use a single sign as replacement for the umlauts. Means it is conform with the Duden rules (DIN 5007-1) besides for the letter ß. According to the DIN ß should be replaced with double s but in latin1_german1_ci it is just replaced with s.

Anyway, of course, collations are implemented by using the hex values and so it is totally clear why IBM latin1 won't work with MySQL collations. IBM is using different hex values and so sorting is sucking.

The only solution that I am able to see here would be that we could try to implement collations for the IBM charsets. But that would mean that we need to implement collations for every language. A really huge effort for a dying charset. Too huge in my eyes, but before I will set this to "Won't fix" I will ask our development.

And yes, because of UNICODE all the ISO character sets are on the way to die.

Additionally, I am able to see two workarounds here:

Unfortunately, I am not sure here how IBM implemented utf8. But with a really high chance it is UNICODE conform implemented. Which means utf8 should be same in MySQL and IBM.

So, switching to UTF8 maybe would be a solution.

You could use utf8_general_ci here. It also replaces only with single signs so that ä=a, ö=o, ü=u and ß=s. If you want this but double s for ß then you could use utf8_unicode_ci.

Additionally, we are on the way to implement some more utf8 collations for German language.

The second workaround is that MySQL supports user defined collations. You just could create your own collation.

How this works is described here:
http://forge.mysql.com/wiki/How_to_Add_a_Collation

Hi Martin,

can you please clarify some details:

1. Do you dislike only HEX codes? Or do you actually
   see wrong characters on the screen?

2. Which OS and which client tool are you using on the MySQL side?
   What does "show variables like 'character_set%';" returns for you?

3. Which OS and which client tool are you using on the IBM side?

4. What does the string "{¦}[\]~" mean?
   Are they the characters you see on the screen,
   instead of the expected German characters?

5. You told how the INSERT queries look like on the both side.
   Can you also tell how exactly the SELECT queries look like, 
   both on MySQL and IBM side.

Thanks!

Martin,

Please also run this query:

SELECT hex(test_fld), test_fld FROM coll.test_tbl;

Both after inserting on MySQL side, and after inserting on IBM side.

What results do you get?

Thanks!

@Tim:
Your suggested workaround isn't useable for us, because the conversion has to be done on a field by field basis as far as I know. Therefore everytime a table changes the according view had to be changed also, what results in really tedious maintainance work, not only for me but for everyone who wants to use a CCSID other than 1148.

@Susanne:
- UTF-8 is what we are also trying to use and there are two problems:
1.) We've got serious performance problems, which seem to come from badly optimized queries, which in turn stem from not working indizes. (Bug: )
2.) We have an established iSeries product that is written in RPG and uses single byte charset throughout the application. Combining this with UTF-8 and therefore GRAPHIC-fields in DB2 is no easy task either

- I don't know at the moment how a user defined collation could help in giving my RPG-Application acces to the right HEX-values, but i have to further look into this topic to really understand what a new collation can do and what not.

@Alexander:
Hi Alexander,
I edited the answer to your questions directly into your post, hope this is ok for you.

1. Do you dislike only HEX codes?
   --> What do you mean by that?
   
   Or do you actually see wrong characters on the screen?
   --> Yes

2. Which OS and which client tool are you using on the MySQL side?
   --> MySQL runs in i5/OS V6R1 PASE. I used phpmyadmin and bin/mysql, both with the same result.

   What does "show variables like 'character_set%';" returns for you?
   -->
   *************************** 1. row ***************************
   Variable_name: character_set_client                           
           Value: latin1                                         
   *************************** 2. row ***************************
   Variable_name: character_set_connection                       
           Value: latin1                                         
   *************************** 3. row ***************************
   Variable_name: character_set_database                         
           Value: latin1                                         
   *************************** 4. row ***************************
   Variable_name: character_set_filesystem
           Value: binary
   *************************** 5. row ***************************
   Variable_name: character_set_results
           Value: latin1
   *************************** 6. row ***************************
   Variable_name: character_set_server
           Value: latin1
   *************************** 7. row ***************************
   Variable_name: character_set_system
           Value: utf8
   *************************** 8. row ***************************
   Variable_name: character_sets_dir
           Value: /QOpenSys/usr/local/mysql/mysql-5.1.39-i5os-power-64bit/share/charsets/

3. Which OS and which client tool are you using on the IBM side?
   --> i5/OS V6R1, STRSQL

4. What does the string "{¦}[\]~" mean?
   Are they the characters you see on the screen,
   instead of the expected German characters?
   --> Yes

5. You told how the INSERT queries look like on the both side.
   Can you also tell how exactly the SELECT queries look like, 
   both on MySQL and IBM side.
   --> MySQL: 'SELECT * FROM coll.test_tbl'
       IBM:   'SELECT * FROM "coll"."test_tbl"'
       Hex-values were obtained with 'DSPPFM FILE("coll"/"test_tbl")' and F10
       and F11

6. Please also run this query:
   SELECT hex(test_fld), test_fld FROM coll.test_tbl;
   Both after inserting on MySQL side, and after inserting on IBM side.
   What results do you get?
   -->
   After inserting from MySQL
   +----------------+----------+
   | hex(test_fld)  | test_fld |
   +----------------+----------+
   | E4F6FCC4D6DCDF | äöüÄÖÜß  |
   +----------------+----------+
   
   After inserting from IBM
   +----------------+----------+
   | hex(test_fld)  | test_fld |
   +----------------+----------+
   | E4F6FCC4D6DCDF | äöüÄÖÜß  |
   | 7BA67D5B5C5D7E | {¦}[\]~  |
   +----------------+----------+

Thanks for your efforts.

@Susanne:
Sorry, forgot to mention the bug id in the previous comment. Of course it's 49205.

Martin,

I've done some further investigation, and I'm seeing behavior similar to what you have seen. I believe that the problem lies with the use of STRSQL. I'm not sure yet whether this is "correct" behavior for STRSQL, but I can see how it is not what you would like to see. 

Ignoring STRSQL for the moment, have you confirmed that the having the column tagged as 1148 is truly a problem for your applications? If your RPG or similar apps are opening the file enabled for character translation to 1141, you should be receiving the data as you expect. That is the very purpose for tagging columns with a particular CCSID. For example, if I use the Run SQL Scripts tool in iSeries Navigator, I get the correct results for the sample data that you provided.

Please try accessing the data through an interface other than STRSQL and let us know whether you are still are having problems.

Thank you Tim for pointing me in the right direction. The problem is solved as afar as I'm concerned.

Seeing that the iSeries Navigator showed me the right characters and getting the wrong ones in my RPG application perplexed me finally. So checking all system values all over again, I found out that our QCCSID was set to 65535 and changing this to 1141 solved the problem.

Sorry for bothering you with this bug report, although may I suggest adding this behaviour (and it's solution) to the Redbook 'Using IBM DB2 for i as a Storage Engine of MySQL (SG24-7705)' in chapter 2.10 'National language support in IBMDB2I'? I believe that I'm not the only one on the i5, who didn't have to do anything with CCSIDs before starting to use MySQL.

So long story short,
a big THANK YOU to all of you
and Merry Christmas and a Happy New Year!

Should have been closed long ago.