MySQL Bugs: #45803: Inaccurate estimates for partial key values with IBMDB2I

Bug #45803	Inaccurate estimates for partial key values with IBMDB2I
Submitted:	28 Jun 2009 2:57	Modified:	14 Jul 2009 15:03
Reporter:	Tim Clark	Email Updates:
Status:	Closed	Impact on me:	None
Category:	MySQL Server: DB2SE for IBM i	Severity:	S3 (Non-critical)
Version:	5.1.35	OS:	IBM i
Assigned to:	Tim Clark	CPU Architecture:	Any
Tags:	ibmdb2i, Optimizer, partial key

Description:
Some collations cause IBMDB2I to report inaccurate key range estimations to the optimizer for LIKE clauses that select substrings. This can be seen by running EXPLAIN. This problem primarily affects multi-byte and unicode character sets. Examples of several affected collations are shown below.

How to repeat:
In each of the following cases, EXPLAIN should use the key explicitly designated by "force index" and return 6 for the estimated number of rows. 

CREATE TABLE t1 (ascii_bin integer, c char(10), v varchar(20), index(c), index(v)) collate ascii_bin engine=ibmdb2i;
insert into t1 (c,v) values ("abc","def"),("abcd", "def"),("abcde","defg"),("aaaa","bbbb");
insert into t1 select * from t1;
explain select c,v from t1 force index(v) where v like "de%";
id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	t1	ALL	v	NULL	NULL	NULL	8	Using where

CREATE TABLE t1 (cp1250_bin integer, c char(10), v varchar(20), index(c), index(v)) collate cp1250_bin engine=ibmdb2i;
insert into t1 (c,v) values ("abc","def"),("abcd", "def"),("abcde","defg"),("aaaa","bbbb");
insert into t1 select * from t1;
explain select c,v from t1 force index(c) where c like "ab%";
id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	t1	ALL	c	NULL	NULL	NULL	8	Using where

CREATE TABLE t1 (utf8_estonian_ci integer, c char(10), v varchar(20), index(c), index(v)) collate utf8_estonian_ci engine=ibmdb2i;
insert into t1 (c,v) values ("abc","def"),("abcd", "def"),("abcde","defg"),("aaaa","bbbb");
insert into t1 select * from t1;
explain select c,v from t1 force index(c) where c like "ab%";
id	select_type	table	type	possible_keys	key	key_len	ref	rows	Extra
1	SIMPLE	t1	range	c	c	31	NULL	1	Using where

Suggested fix:
Fix the way these partial keys are estimated so that good results are returned to the optimizer.

A patch for this bug has been committed. After review, it may
be pushed to the relevant source trees for release in the next
version. You can access the patch from:

  http://lists.mysql.com/commits/77987

3000 V Narayanan	2009-07-06
      Bug#45803 Inaccurate estimates for partial key values with IBMDB2I
      
      Some collations were causing IBMDB2I to report
      inaccurate key range estimations to the optimizer
      for LIKE clauses that select substrings. This can
      be seen by running EXPLAIN. This problem primarily
      affects multi-byte and unicode character sets.
      
      This patch involves substantial changes to several
      modules. There are a number of problems with the
      character set and collation handling. These problems
      have been or are being fixed,  and a comprehensive
      test has been included which should provide much
      better coverage than there was before. This test
      is enabled only for IBM i 6.1, because that version
      has support for the greatest number of collations.
     @ mysql-test/suite/ibmdb2i/r/ibmdb2i_collations.result
        Bug#45803 Inaccurate estimates for partial key values with IBMDB2I
        
        result file for test case.
     @ mysql-test/suite/ibmdb2i/t/ibmdb2i_collations.test
        Bug#45803 Inaccurate estimates for partial key values with IBMDB2I
        
        Tests for character sets and collations. This test
        is enabled only for IBM i 6.1, because that version
        has support for the greatest number of collations.
     @ storage/ibmdb2i/db2i_conversion.cc
        Bug#45803 Inaccurate estimates for partial key values with IBMDB2I
        
        - Added support in convertFieldChars to enable records_in_range
          to determine how many substitute characters were inserted and
          to suppress conversion warnings.
        
        - Fixed bug which was causing all multi-byte and Unicode fields
          to be created as UTF16 (CCSID 1200) fields in DB2. The corrected
          code will now create UCS2 fields as UCS2 (CCSID 13488), UTF8
          fields (except for utf8_general_ci) as UTF8 (CCSID 1208), and
          all other multi-byte or Unicode fields as UTF16.  This will only
          affect tables that are newly created through the IBMDB2I storage
          engine. Existing IBMDB2I tables will retain the original CCSID
          until recreated. The existing behavior is believed to be
          functionally correct, but it may negatively impact performance
          by causing unnecessary character conversion. Additionally, users
          accessing IBMDB2I tables through DB2 should be aware that mixing 
          tables created before and after this change may require extra type
          casts or other workarounds.  For this reason, users who have
          existing IBMDB2I tables using a Unicode collation other than
          utf8_general_ci are encouraged to recreate their tables (e.g.
          ALTER TABLE t1 ENGINE=IBMDB2I) in order to get the updated CCSIDs
          associated with their DB2 tables.
        
        - Improved error reporting for unsupported character sets by forcing
          a check for the iconv conversion table at table creation time,
          rather than at data access time.
     @ storage/ibmdb2i/db2i_myconv.h
        Bug#45803 Inaccurate estimates for partial key values with IBMDB2I
        
        Fix to set errno when iconv fails.
     @ storage/ibmdb2i/db2i_rir.cc
        Bug#45803 Inaccurate estimates for partial key values with IBMDB2I
        
        Significant improvements were made to the records_in_range code
        that handles partial length string data in keys for optimizer plan
        estimation. Previously, to obtain an estimate for a partial key
        value, the implementation would perform any necessary character
        conversion and then attempt to determine the unpadded length of
        the partial key by searching for the minimum or maximum sort
        character. While this algorithm was sufficient for most single-byte
        character sets, it did not treat Unicode and multi-byte strings
        correctly. Furthermore, due to an operating system limitation,
        partial keys having UTF8 collations (ICU sort sequences in DB2)
        could not be estimated with this method.
        
        With this patch, the code no longer attempts to explicitly determine
        the unpadded length of the key. Instead, the entire key is converted
        (if necessary), including padding, and then passed to the operating
        system for estimation. Depending on the source and target character
        sets and collations, additional logic is required to correctly
        handle cases in which MySQL uses unconvertible or differently
        -weighted values to pad the key. The bulk of the patch exists
        to implement this additional logic.
     @ storage/ibmdb2i/ha_ibmdb2i.h
        Bug#45803 Inaccurate estimates for partial key values with IBMDB2I
        
        The convertFieldChars declaration was updated to support additional 
        optional behaviors.

Pushed into 5.1.37 (revid:joro@sun.com-20090708131116-kyz8iotbum8w9yic) (version source revid:joro@sun.com-20090707141421-xnqa7o1gvxjrsn3n) (merge vers: 5.1.37) (pib:11)

Pushed into 5.1.37 (revid:joro@sun.com-20090708131116-kyz8iotbum8w9yic) (version source revid:joro@sun.com-20090707141421-xnqa7o1gvxjrsn3n) (merge vers: 5.1.37) (pib:11)

Pushed into 5.4.4-alpha (revid:anozdrin@bk-internal.mysql.com-20090710111017-bnh2cau84ug1hvei) (version source revid:patrick.crews@sun.com-20090707142559-db036i153033lorm) (merge vers: 5.4.4-alpha) (pib:11)

An entry has been added to the 5.1.37 and 5.4.4 changelog: 

Some collations were causing IBMDB2I to report inaccurate key range estimations to the optimizer for LIKE clauses that select substrings. This can be seen by running EXPLAIN. This problem primarily affects multi-byte and unicode character sets

Noted in 5.4.2 changelog because next 5.4 version will be 5.4.2 and not 5.4.4.

Ignore previous comment about 5.4.2.

Pushed into 5.1.37-ndb-7.0.8 (revid:jonas@mysql.com-20090826132541-yablppc59e3yb54l) (version source revid:jonas@mysql.com-20090826132541-yablppc59e3yb54l) (merge vers: 5.1.37-ndb-7.0.8) (pib:11)

Pushed into 5.1.37-ndb-6.3.27 (revid:jonas@mysql.com-20090826105955-bkj027t47gfbamnc) (version source revid:jonas@mysql.com-20090826105955-bkj027t47gfbamnc) (merge vers: 5.1.37-ndb-6.3.27) (pib:11)

Pushed into 5.1.37-ndb-6.2.19 (revid:jonas@mysql.com-20090825194404-37rtosk049t9koc4) (version source revid:jonas@mysql.com-20090825194404-37rtosk049t9koc4) (merge vers: 5.1.37-ndb-6.2.19) (pib:11)

Pushed into 5.1.35-ndb-7.1.0 (revid:magnus.blaudd@sun.com-20090827163030-6o3kk6r2oua159hr) (version source revid:jonas@mysql.com-20090826132541-yablppc59e3yb54l) (merge vers: 5.1.37-ndb-7.0.8) (pib:11)

The 5.4 fix has been pushed to 5.4.2.