MySQL Bugs: #2565: subselect poor performance (shockingly)

Bug #2565	subselect poor performance (shockingly)
Submitted:	29 Jan 2004 17:14	Modified:	18 Feb 2004 3:41
Reporter:	Carson Battlelite	Email Updates:
Status:	Not a Bug	Impact on me:	None
Category:	MySQL Server	Severity:	S1 (Critical)
Version:	4.1.1 and above	OS:	ANY PLATFORM
Assigned to:	Oleksandr Byelkin	CPU Architecture:	Any

Description:
Please see the repeat section.

How to repeat:
I have pasted a perl script (buggy.pl) that demonstrates the problem. 

output from the script is of the following form:
  sysdate 
  query with subquery 
  sysdate 
  query with (subquery expanded by hand) 
  sysdate 

87 secs versus 7 secs seems to imply the subquery is run for every record in the outer query??? 

Of course please read the perl to see that I only create four files, three data files and an sql file. 
Please also read the sql file to see the point of the demonstration. 
Change the values for max_session and/or infs_per_session if needed. 

On a unix box 
./buggy.pl 
/usr/local/mysql/bin/mysql < inf_sess_act.sql 

On my box with the current values in the perl script I get 
SYSDATE() 
2004-01-30 11:47:09 
count(*) 
318182 
SYSDATE() 
2004-01-30 11:48:36 
count(*) 
318182 
SYSDATE() 
2004-01-30 11:48:43 

######################### PERL SCRIPT START ############################## 

#!/usr/bin/perl 

$max_session = 5000; 

$infs_per_session = 140; 

$max_inf = $infs_per_session * $max_session; 

@sessions = (1..$max_session); 
foreach $i (reverse 0 .. @sessions ) { 
   #remove every 13th entry 
   if (!($i % 13)) { 
splice(@sessions, $i, 1); 
   } 
} 

@infs = (1..$max_inf); 
foreach $i (reverse 0 .. @infs ) { 
   #remove every 11th entry 
   if (!($i % 11)) { 
splice(@infs, $i, 1); 
   } 
} 

foreach $i (0..@infs) { 
   $inf_sess{$i} = $sessions[$i % @sessions]; 
} 

@action = (0..9); 
foreach $i (@action) { 
   $act{$i} = (($i + 1)% 2) ? "'GOOD'" : "'BAD'"; 
   push(@good_action, $i) if (($i + 1)% 2) ; 
} 

open(SESSION, "> session.dat"); 
print SESSION join("\n", @sessions),"\n"; 
close SESSION; 

open(INF, "> inf.dat"); 
foreach $a (keys %inf_sess) { 
   print INF "$a,$inf_sess{$a},$action[$a % @action]\n"; 
} 
close INF; 

open(ACTION, "> action.dat"); 
foreach $a (sort {$a <=> $b} keys %act) { 
   print ACTION "$a,$act{$a}\n"; 
} 
close ACTION ; 

open(SCRIPT, "> inf_sess_act.sql"); 
print SCRIPT "USE TEST;\n"; 
print SCRIPT "DROP TABLE IF EXISTS sess;\n"; 
print SCRIPT "DROP TABLE IF EXISTS inf;\n"; 
print SCRIPT "DROP TABLE IF EXISTS act;\n"; 
print SCRIPT "create table sess (id integer, PRIMARY KEY (id));\n"; 
print SCRIPT "create table inf (id integer, session_id integer, action_id integer,PRIMARY KEY (id));\n"; 
print SCRIPT "create table act (id integer, code varchar(20), PRIMARY KEY (id));\n"; 
print SCRIPT "LOAD DATA LOCAL INFILE 'session.dat' INTO TABLE sess FIELDS TERMINATED BY ',';\n"; 
print SCRIPT "LOAD DATA LOCAL INFILE 'inf.dat' INTO TABLE inf FIELDS TERMINATED BY ',';\n"; 
print SCRIPT "LOAD DATA LOCAL INFILE 'action.dat' INTO TABLE act FIELDS TERMINATED BY ',' ENCLOSED BY \"'\";\n"; 
print SCRIPT "SELECT SYSDATE() FROM DUAL;\n"; 
print SCRIPT "SELECT count(*) FROM sess s, inf i WHERE s.id = i.session_id AND i.action_id IN (SELECT id FROM act WHERE code = 'GOOD');\n"; 
print SCRIPT "SELECT SYSDATE() FROM DUAL;\n"; 
print SCRIPT "SELECT count(*) FROM sess s, inf i WHERE s.id = i.session_id AND i.action_id IN (" . join(",", @good_action) . ");\n"; 
print SCRIPT "SELECT SYSDATE() FROM DUAL;\n"; 
close SCRIPT ; 
######################### PERL SCRIPT END ############################## 

Suggested fix:
If there is no functional dependency from the inner query to the outer query then the results should be cached as per the second query in the example.

Have tested this on Windows 2000 - same performance problems

generates three data files and an sql file to load the data

Attachment: buggy.pl (text/plain), 2.21 KiB.

I grabbed 4.1.1a-alpha-nt and had the same performance problems - generated from buggy.pl

I will attempt to add buggy.pl to the files so that people don't have to cut and paste

I have found that rewriting the query not to use subqueries performs just as badly 

i.e.
SELECT count(*) FROM sess s, inf i, act a WHERE s.id = i.session_id AND
i.action_id = a.id and a.code = 'GOOD'

Thank you for bugreport, it was really helpful to testing subqueries 
performance! 
 
Of course list of constants will be faster then using tables (especially 
because MySQL sort constants before serching). I think subqueries should be 
compared with JOINs in this case to be identical. If this query will be  
rewriten with JOINs it will be slowly then subquery. 
 
I have tested following 2 queries: 
 
SELECT count(*) FROM sess s, inf i WHERE s.id = i.session_id AND i.action_id 
IN (SELECT id FROM act WHERE code = 'GOOD'); 
results are (sec.) 7.43, 9.04, 8.33, 8.07 
 
SELECT count(*) FROM sess s, inf i, act WHERE s.id = i.session_id AND 
i.action_id = act.id and act.code = 'GOOD' 
results are (sec.) 9.01, 9.58, 9.66, 10.24 
 
i.e. subquery is faster then join in this case (because of special 
optimisation for subqueries like "value IN (SELECT key_field FROM single_table 
WHERE some_conditions)" because such subqueries are widely used. 
 
In some of 5.x version we plan to add optimisation for IN subqueries with hash 
tables and in this case it will be about same performance as constant list. 
But you can't compare constant list with subquery, because: 
1. subqery you can use everywere, constant list only if you completely sure 
that query with constants fit in maximum length of command 
2. for subqueries all tables locked at one moment (or it will be done alwais 
as one transaction), for variant with constant and autocommit mode you can get 
inconsistent results, because tables can be changed between 2 queries (it is 
not important for you application as I understand but it is still argument :)