Bug #86710 Performance issue of explicit COLLATE clause
Submitted: 15 Jun 2017 8:57 Modified: 28 Jul 2017 2:01
Reporter: Yitao Xu Email Updates:
Status: Closed Impact on me:
None 
Category:MySQL Server: Optimizer Severity:S2 (Serious)
Version:8.0.2-dmr-enterprise-commercial-advanced OS:Linux
Assigned to: Xing Zhang CPU Architecture:Any
Tags: explicit COLLATE, Optimizer, performance

[15 Jun 2017 8:57] Yitao Xu
Description:
I created a table 
CREATE TABLE chartest.jpcharcollja_test (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `charcol1` VARCHAR(100) DEFAULT NULL,
    `charcol2` VARCHAR(100)CHARACTER SET UTF8MB4 COLLATE UTF8MB4_0900_AI_CI DEFAULT NULL,
    `charcol3` VARCHAR(100)CHARACTER SET UTF8MB4 COLLATE UTF8MB4_BIN DEFAULT NULL,
    `charcol4` VARCHAR(100)CHARACTER SET UTF8MB4 COLLATE UTF8MB4_JA_0900_AS_CS DEFAULT NULL,
    `charcol5` VARCHAR(100)CHARACTER SET UTF8MB4 COLLATE UTF8MB4_JA_0900_AS_CS_KS DEFAULT NULL,
    PRIMARY KEY (`id`)
)  ENGINE=INNODB DEFAULT CHARSET=UTF8MB4

The data of charcol1, charcol2 ... charcol5 are same, but collation are different.(In fact, charcol1 as same as charcol2)。
I execute a "select" clause.

mysql> select * from chartest.jpcharcollja_test order by charcol1;
...
300000 rows in set (0.60 sec)

Then, I explicit COLLATE the defaults collation, executed the "select" clause again
mysql> select * from chartest.jpcharcollja_test order by charcol1 collate UTF8MB4_0900_AI_CI;
...
300000 rows in set (1.68 sec)

Then, I changed a new collation, executed the "select" clause again
mysql> select * from chartest.jpcharcollja_test order by charcol1 collate UTF8MB4_JA_0900_AS_CS_KS;
...
300000 rows in set (53.79 sec)

Another pattern:
mysql> select * from chartest.jpcharcollja_test order by charcol5;
...
300000 rows in set (0.77 sec)

Then, I explicit COLLATE the new collation executed the "select" clause again
mysql> select * from chartest.jpcharcollja_test order by charcol5 collate UTF8MB4_JA_0900_AS_CS_KS;
...
300000 rows in set (54.05 sec)

------------------------------------------------------------------------------------------------------------------------------------------------------
For above results, I think maybe optimizer doesn't work for the explicit COLLATE. Even the collation as same as the column's collation when it's created by defaults. Execution time is twice the default.
For pattern 2, execution time is very poor.
This should be a very common application scenario, when the application wants to display the results in a different collation, usually do not change the table definition, but in the generated SQL to specify the collation.

How to repeat:
Step1 Create a table
CREATE TABLE chartest.jpcharcollja_test (
    `id` INT(11) NOT NULL AUTO_INCREMENT,
    `charcol1` VARCHAR(100) DEFAULT NULL,
    `charcol2` VARCHAR(100)CHARACTER SET UTF8MB4 COLLATE UTF8MB4_0900_AI_CI DEFAULT NULL,
    `charcol3` VARCHAR(100)CHARACTER SET UTF8MB4 COLLATE UTF8MB4_BIN DEFAULT NULL,
    `charcol4` VARCHAR(100)CHARACTER SET UTF8MB4 COLLATE UTF8MB4_JA_0900_AS_CS DEFAULT NULL,
    `charcol5` VARCHAR(100)CHARACTER SET UTF8MB4 COLLATE UTF8MB4_JA_0900_AS_CS_KS DEFAULT NULL,
    PRIMARY KEY (`id`)
)  ENGINE=INNODB DEFAULT CHARSET=UTF8MB4
Step2
insert data
Data are same for every column.
sample data:
# id, charcol1, charcol2, charcol3, charcol4, charcol5
10737, きたわら かずのり, きたわら かずのり, きたわら かずのり, きたわら かずのり, きたわら かずのり
163545, つのおか のぶや, つのおか のぶや, つのおか のぶや, つのおか のぶや, つのおか のぶや
264840, 井手, 井手, 井手, 井手, 井手
167580, もりかわ ひでや, もりかわ ひでや, もりかわ ひでや, もりかわ ひでや, もりかわ ひでや
193661, マスタ トシヒロ, マスタ トシヒロ, マスタ トシヒロ, マスタ トシヒロ, マスタ トシヒロ
48624, 増江 雄仁, 増江 雄仁, 増江 雄仁, 増江 雄仁, 増江 雄仁
182350, すえもり ひろただ, すえもり ひろただ, すえもり ひろただ, すえもり ひろただ, すえもり ひろただ
177443, 北里 隆嗣, 北里 隆嗣, 北里 隆嗣, 北里 隆嗣, 北里 隆嗣
149921, キタウラ ケンスケ, キタウラ ケンスケ, キタウラ ケンスケ, キタウラ ケンスケ, キタウラ ケンスケ
160313, ハコイ カツマサ, ハコイ カツマサ, ハコイ カツマサ, ハコイ カツマサ, ハコイ カツマサ

Step3 execute below SQL then confirm the execution time
select * from chartest.jpcharcollja_test order by charcol1;
select * from chartest.jpcharcollja_test order by charcol1 collate UTF8MB4_0900_AI_CI;
select * from chartest.jpcharcollja_test order by charcol1 collate UTF8MB4_JA_0900_AS_CS_KS;

select * from chartest.jpcharcollja_test order by charcol5;
select * from chartest.jpcharcollja_test order by charcol5 collate UTF8MB4_JA_0900_AS_CS_KS;

Suggested fix:
I think optimizer doesn't work for the explicit COLLATE.
[15 Jun 2017 8:58] Yitao Xu
sample data

Attachment: jpchartest_data.zip (application/zip, text), 1.10 MiB.

[16 Jun 2017 10:26] Umesh Shastry
Hello Yitao!

Thank you for the report and test case.

Thanks,
Umesh
[28 Jul 2017 2:01] Paul Dubois
Posted by developer:
 
Fixed in 8.0.3.

For a VARCHAR column, sorting using an explicit collation (ORDER BY
col_name COLLATE collation_name) was much slower than with an
implicit collation (no COLLATE clause), even if the explicit
collation was the same as the implicit collation.