MySQL Bugs: #86517: Tend to ER_OUT_OF_SORTMEMORY on UTF8MB4_UNICODE

Bug #86517	Tend to ER_OUT_OF_SORTMEMORY on UTF8MB4_UNICODE_CI much frequently (than UTF8MB4
Submitted:	31 May 2017 8:02	Modified:	13 Jun 2017 16:45
Reporter:	Seunguck Lee	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: Charsets	Severity:	S3 (Non-critical)
Version:	5.7.17	OS:	Any
Assigned to:		CPU Architecture:	Any
Tags:	ER_OUT_OF_SORTMEMORY, strnxfrmlen, utf8mb4_unicode_ci

Description:
I don't know this is bug or not.
But strnxfrmlen() of UTF8MB4_UNICODE_CI column return much bigger sort length than UTF8MB4_GENERAL_CI.

For example, in my case
  strnxfrmlen() of VARCHAR(30) UTF8MB4_GENERAL_CI column return 60, 
  But strnxfrmlen() of VARCHAR(30) UTF8MB4_UNICODE_CI column return 960.

So ER_OUT_OF_SORTMEMORY(1038 Out of sort memory, consider increasing server sort buffer size) is take place much frequently
Mysqld needs 16x times bigger memory space to sort when column's collation is UTF8MB4_UNICODE_CI.

If this is expected case, I want you add some explanation about increasing sort_buffer_size when UTF8MB4_UNICODE_CI column is used.
And still I don't understand why mysqld need 16x times more memory buffer for UTF8MB4_UNICODE_CI column to sort.

We can avoid this case with bigger sort_buffer_size, but we need to set much bigger size of sort_buffer_size to sort big varchar columns.

How to repeat:

CREATE TABLE `tab_unicode` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `fd1` int(11) DEFAULT NULL,
  `fd2` varchar(30) COLLATE UTF8MB4_UNICODE_CI DEFAULT NULL,
  `fd3` varchar(30) COLLATE UTF8MB4_UNICODE_CI DEFAULT NULL,
  `fd4` varchar(40) COLLATE UTF8MB4_UNICODE_CI DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

CREATE TABLE `tab_general` (
  `id` bigint(20) NOT NULL AUTO_INCREMENT,
  `fd1` int(11) DEFAULT NULL,
  `fd2` varchar(30) COLLATE UTF8MB4_GENERAL_CI DEFAULT NULL,
  `fd3` varchar(30) COLLATE UTF8MB4_GENERAL_CI DEFAULT NULL,
  `fd4` varchar(40) COLLATE UTF8MB4_GENERAL_CI DEFAULT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB;

insert into tab_unicode select null, ordinal_position, table_schema, table_name, column_name from information_schema.columns;
insert into tab_general select null, ordinal_position, table_schema, table_name, column_name from information_schema.columns;

set sort_buffer_size = 32*1024;
set optimizer_trace="enabled=on";

select 1 from tab_unicode order by fd2, fd3, fd4 limit 1000; -- // this query will throw ER_OUT_OF_SORTMEMORY
select 1 from tab_general order by fd2, fd3, fd4 limit 1000;

select trace from information_schema.optimizer_trace;
-- // check row_size field of json result. and compare

Or you can debug strnxfrmlen().

Suggested fix:
If this is expected case then I want you add some explanation about increasing sort_buffer_size when UTF8MB4_UNICODE_CI column is used.
If not fix it.

The strnxfrmlen function I've mentioned is here(https://github.com/mysql/mysql-server/blob/5.7/sql/filesort.cc#L2342)

Hi!

Your report is correct. Only, I re-arranged your test case so that query with _GENERAL_CI is run before the one with _UNICODE_CI. And I have debugged in the debugger. First three values are lengths for general collation and second three for the unicode collation:

(lldb) print sortorder->length
(uint) $3 = 180
2017-06-13T16:39:15.074631Z 0 [Note] InnoDB: page_cleaner: 1000ms intended loop took 13077ms. The settings might not be optimal. (flushed=0 and evicted=0, during the time.)
(lldb) print sortorder->length
(uint) $4 = 180
(lldb) print sortorder->length
(uint) $5 = 400
(lldb) print sortorder->length
(uint) $6 = 2880
(lldb) print sortorder->length
(uint) $7 = 2880
(lldb) print sortorder->length
(uint) $8 = 3200

Unrelated to the issue is that your sort buffer is too small. Please, use our default.

Verified.