MySQL Bugs: #14545: Median SQL Aggregate function

Bug #14545	Median SQL Aggregate function
Submitted:	1 Nov 2005 17:01	Modified:	1 Nov 2005 18:08
Reporter:	Joe Genshlea	Email Updates:
Status:	Verified	Impact on me:	None
Category:	MySQL Server: DML	Severity:	S4 (Feature request)
Version:		OS:	Any (any)
Assigned to:		CPU Architecture:	Any

Description:
Hi...I think it would be a great thing for MySQL to be the first DBServer (well, that I'm aware of)  to offer the Median Function in Group By Aggregates.  

How to repeat:
The value where 1/2 of the values fall above the value and 1/2 of the values fall below the value

The problem with this is that it requires way more resources than most existing aggregate functions (with the exception of GROUP_CONCAT, but that has built in limits). Most aggregate function results can be calculated with constant storage,
e.g. MIN() and MAX() only need to carry the so far smallest/largest value with them, SUM() and COUNT() only needs to update a single variable for each row in the group, AVG() needs to keep track of sum and count ...

MEDIAN() needs to keep an ordered list of all distinct values seen together with their counts plus the total row count, and only after processing all rows in a GROUP can it traverse the values list and stop at the entry that includes the (total/2)th row

So MEDIAN()s memory footprint, when using the big O notation for this, will be O(n) in the worst case while (as far as i can tell) all other existing MySQL aggregate functions with the exception of GROUP_CONCAT are O(1)

Add to this that the required memory may exceed what is available to the server so that MEDIAN() would have to be able to move its temoporary data to disk (even in cases where the actual GROUP could use an existing index just fine), assuming that returning incomplete results on hitting a given memory limit like GROUP_CONCAT is not an option for a statistics aggregate ...

So an implementation of MEDIAN() taking care of all this would both be rather complex and potentially inefficient for large data sets which is probably the 
reason its not implemented anywhere yet?

Sure, Median might be inefficient, but if you need, you need it.  The kludges most people come up with are going to be more inefficient.  The best SQL work around I've seen uses GROUP_CONCAT twice and requires a good set of glasses:

select ( substring_index( substring_index( group_concat( MY_NUM order by MY_NUM ) , ',' , ceiling(count(*)/2) ) , ',' , -1 ) + substring_index( substring_index( group_concat( MY_NUM order by MY_NUM ) , ',' , - ceiling(count(*)/2) ) , ',' , 1 ) ) / 2 as median from MY_TABLE;

Build in a Median function and issue a warning when it's used AND the query exceeds long_query_time?

I don't know how the aggregate functions work internally in mysql but a median or percentile function should not need to use much memory after all it is only picking an item from a list.

If the aggregate functions get passed an array of values then I can see how they would use a lot of memory. If this was changed so that the median or percentile function had access to the total number and the current row then it would only need to store at most 2 rows to do weighting on percentiles.

If the items were sorted, and there were 5 values (count), then you pick the 3rd item (current row == 3). 

I can already do this using 2 queries

set @count=(select count(*) from table);
select * from
(select @rank:=@rank+1 as rank, duration from table, (select @rank:=0)
order by duration) as ranked 
where rank = @count/2; -- crude median function, replace with appropriate version

I filed another FR of percentile_cont (SQL:2008) as Bug#93234

If this function is implemented, we can use it for median.