MySQL Bugs: #65243: Substring function very slow for longer strings

Bug #65243	Substring function very slow for longer strings
Submitted:	8 May 2012 17:50	Modified:	8 Jan 2020 22:51
Reporter:	Franjo Markovic	Email Updates:
Status:	Can't repeat	Impact on me:	None
Category:	MySQL Server: DML	Severity:	S5 (Performance)
Version:	5.6.4	OS:	Windows
Assigned to:		CPU Architecture:	Any

Description:
Substring function takes a long time to return result. This makes parsing any bigger text inside stored procedures quite unworkable.

How to repeat:

Create a string of 75 KB (or more):

	set @str1 = repeat("abcgd",15000);

Then try to analyse 25K of text from it:

	select benchmark(25000, substring(@str1,50000,1));

this would take about 6 seconds on my quad-core 3.3GHZ pc, or 8 seconds on 2.6 GHz 8-core power mac. Search twice as deep (in twice longer text) and it will be about twice slower. All such operations can be done in few miliseconds with either java or visual basic.
This works the same in 5.6.5. I guess that 5.1 was faster, but have no way to verify it any more.

Suggested fix:
Please try to improve performance. It should not take that much time to count letters in a string.

Please try 5.6.5-m8 and print here your result. Thank you.

It's the same with 5.6.5 - about 6.6 seconds on windows pc.

Here's the testcase I used:

	select version();
	set names latin1; # note!
	set @str1 = repeat("a",1000000);
	select benchmark(25000, substring(@str1,50000,1));
	select benchmark(25000, substring(@str1,50000,1));
	select benchmark(25000, substring(@str1,50000,1));

Results for latin1:
-------------------------------------
5.5.24:  2.00 sec, 2.00 sec, 2.02 sec
5.1.62:  1.98 sec, 1.98 sec, 2.00 sec
5.6.5:   2.00 sec, 2.00 sec, 1.98 sec
5.0.96:  1.97 sec, 1.97 sec, 1.97 sec

Results for utf8 (set names utf8 in testcase):
----------------------------------------------
5.5.24:  11.03 sec, 11.02 sec, 11.00 sec
5.1.62:   9.75 sec,  9.70 sec,  9.69 sec
5.6.5:   10.00 sec,  9.83 sec,  9.83 sec
5.0.96:   9.86 sec,  9.86 sec, 10.08 sec

There is no regression here.
The 5.5 and 5.6 client set charset automatically depending on your OS/environment.  This is probably why you have worse results than 5.1 who uses latin1.  Please retest by explicitly setting charset in all versions..

Are those times in miliseconds? The don't seem to make much sense to me (but I don't have access to source code).
Can you please compare the same times when running very similar query that takes 0.01 seconds (instead of 6 seconds):

select benchmark(25000, substring(@str1,2,1));

Here's the new testcase I used:

	select version();
	set names utf8; # note!
	set @str1 = repeat("a",1000000);
	select benchmark(25000, substring(@str1,2,1));
	select benchmark(25000, substring(@str1,2,1));
	select benchmark(25000, substring(@str1,2,1));

Results for latin1:
-------------------------------------
5.5.24:  2.02 sec, 2.05 sec, 2.00 sec
5.1.62:  2.00 sec, 1.99 sec, 1.97 sec
5.6.5:   2.00 sec, 2.00 sec, 1.98 sec
5.0.96:  2.09 sec, 2.09 sec, 2.09 sec

Results for utf8 (set names utf8 in testcase):
----------------------------------------------
5.5.24:  2.00 sec, 2.00 sec, 2.00 sec
5.1.62:  1.99 sec, 1.99 sec, 1.99 sec
5.6.5:   2.02 sec, 2.02 sec, 1.98 sec
5.0.96:  1.98 sec, 1.99 sec, 1.98 sec

It looks like this report is simply a feature request to improve the performance of utf8 string processing.  For long strings, latin1 does a straight memcpy, where as utf8 must change each char one at a time.

We are getting closer. Those 2 seconds you reported is probably just the time needed to create @str1 value.
Yes, the issue is about utf8 only. Why does it have to go one char at a time? It's like a doing full table scan each time to retrieve one record (letter). It seems to me that the storage of utf8 characters in memory is inadequate. Most systems store unicode characters as fixed 3 bytes in memory, to avoid such counting. Does mysql not do it?

Posted by developer:
 
The performance problem is not reproducible in MySQL 8.0.18.

The below table shows performance for latin1, utf8mb3 (utf) and utfmb4 in 5.7 and 8.0.
As seen, performance for utf8mb4 is almost as good as for latin1.
utf8mb4 is the recommended replacement for utf8mb3 in MySQL 8.0.

Results for latin1 (100000 iterations)
-------------------------------------
5.7.29 6.41, 6.39, 6.37
8.0.18 6.31, 6.31, 6.30

Results for utf8mb3 (100000 iterations)
---------------------------------------
5.7.29 18.10, 18.04, 18.04
8.0.18 18.00, 18.00, 17.98

Results for utf8mb4 (100000 iterations)
---------------------------------------
5.7.29 23.93, 23.93, 23.92
8.0.18 6.56, 6.56, 6.55