Bug #65243 Substring function very slow for longer strings
Submitted: 8 May 2012 17:50 Modified: 8 Jan 2020 22:51
Reporter: Franjo Markovic Email Updates:
Status: Can't repeat Impact on me:
None 
Category:MySQL Server: DML Severity:S5 (Performance)
Version:5.6.4 OS:Windows
Assigned to: CPU Architecture:Any

[8 May 2012 17:50] Franjo Markovic
Description:
Substring function takes a long time to return result. This makes parsing any bigger text inside stored procedures quite unworkable.

How to repeat:

Create a string of 75 KB (or more):

	set @str1 = repeat("abcgd",15000);

Then try to analyse 25K of text from it:

	select benchmark(25000, substring(@str1,50000,1));

this would take about 6 seconds on my quad-core 3.3GHZ pc, or 8 seconds on 2.6 GHz 8-core power mac. Search twice as deep (in twice longer text) and it will be about twice slower. All such operations can be done in few miliseconds with either java or visual basic.
This works the same in 5.6.5. I guess that 5.1 was faster, but have no way to verify it any more.

Suggested fix:
Please try to improve performance. It should not take that much time to count letters in a string.
[8 May 2012 19:04] MySQL Verification Team
Please try 5.6.5-m8 and print here your result. Thank you.
[8 May 2012 19:14] Franjo Markovic
It's the same with 5.6.5 - about 6.6 seconds on windows pc.
[9 May 2012 3:49] MySQL Verification Team
Here's the testcase I used:

	select version();
	set names latin1; # note!
	set @str1 = repeat("a",1000000);
	select benchmark(25000, substring(@str1,50000,1));
	select benchmark(25000, substring(@str1,50000,1));
	select benchmark(25000, substring(@str1,50000,1));

Results for latin1:
-------------------------------------
5.5.24:  2.00 sec, 2.00 sec, 2.02 sec
5.1.62:  1.98 sec, 1.98 sec, 2.00 sec
5.6.5:   2.00 sec, 2.00 sec, 1.98 sec
5.0.96:  1.97 sec, 1.97 sec, 1.97 sec

Results for utf8 (set names utf8 in testcase):
----------------------------------------------
5.5.24:  11.03 sec, 11.02 sec, 11.00 sec
5.1.62:   9.75 sec,  9.70 sec,  9.69 sec
5.6.5:   10.00 sec,  9.83 sec,  9.83 sec
5.0.96:   9.86 sec,  9.86 sec, 10.08 sec

There is no regression here.
The 5.5 and 5.6 client set charset automatically depending on your OS/environment.  This is probably why you have worse results than 5.1 who uses latin1.  Please retest by explicitly setting charset in all versions..
[9 May 2012 14:17] Franjo Markovic
Are those times in miliseconds? The don't seem to make much sense to me (but I don't have access to source code).
Can you please compare the same times when running very similar query that takes 0.01 seconds (instead of 6 seconds):

select benchmark(25000, substring(@str1,2,1));
[10 May 2012 9:00] MySQL Verification Team
Here's the new testcase I used:

	select version();
	set names utf8; # note!
	set @str1 = repeat("a",1000000);
	select benchmark(25000, substring(@str1,2,1));
	select benchmark(25000, substring(@str1,2,1));
	select benchmark(25000, substring(@str1,2,1));

Results for latin1:
-------------------------------------
5.5.24:  2.02 sec, 2.05 sec, 2.00 sec
5.1.62:  2.00 sec, 1.99 sec, 1.97 sec
5.6.5:   2.00 sec, 2.00 sec, 1.98 sec
5.0.96:  2.09 sec, 2.09 sec, 2.09 sec

Results for utf8 (set names utf8 in testcase):
----------------------------------------------
5.5.24:  2.00 sec, 2.00 sec, 2.00 sec
5.1.62:  1.99 sec, 1.99 sec, 1.99 sec
5.6.5:   2.02 sec, 2.02 sec, 1.98 sec
5.0.96:  1.98 sec, 1.99 sec, 1.98 sec

It looks like this report is simply a feature request to improve the performance of utf8 string processing.  For long strings, latin1 does a straight memcpy, where as utf8 must change each char one at a time.
[10 May 2012 14:51] Franjo Markovic
We are getting closer. Those 2 seconds you reported is probably just the time needed to create @str1 value.
Yes, the issue is about utf8 only. Why does it have to go one char at a time? It's like a doing full table scan each time to retrieve one record (letter). It seems to me that the storage of utf8 characters in memory is inadequate. Most systems store unicode characters as fixed 3 bytes in memory, to avoid such counting. Does mysql not do it?
[8 Jan 2020 22:50] Roy Lyseng
Posted by developer:
 
The performance problem is not reproducible in MySQL 8.0.18.

The below table shows performance for latin1, utf8mb3 (utf) and utfmb4 in 5.7 and 8.0.
As seen, performance for utf8mb4 is almost as good as for latin1.
utf8mb4 is the recommended replacement for utf8mb3 in MySQL 8.0.

Results for latin1 (100000 iterations)
-------------------------------------
5.7.29 6.41, 6.39, 6.37
8.0.18 6.31, 6.31, 6.30

Results for utf8mb3 (100000 iterations)
---------------------------------------
5.7.29 18.10, 18.04, 18.04
8.0.18 18.00, 18.00, 17.98

Results for utf8mb4 (100000 iterations)
---------------------------------------
5.7.29 23.93, 23.93, 23.92
8.0.18 6.56, 6.56, 6.55