Finding the median - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

Finding the median

Author

Topic

AskSQLTeam
Ask SQLTeam Question

0 Posts

Posted - 2002-03-12 : 09:40:38

Ian writes "How do I find the median of a set of values in SQL Server"

davidpardoe
Constraint Violating Yak Guru

324 Posts

Posted - 2002-03-12 : 10:29:21

There are several ways - here is my suggestion.

Insert the sorted values into a temporary table with an identity column. Then calculate which row will hold the median value.

I'm feeling generous - here is an sp that takes the table and field as parameters...


if exists (select name from sysobjects where name='find_median' and type ='P')
	drop procedure find_median
Go

CREATE procedure find_median
(
@table varchar(100),
@field varchar(100)
)
as

set nocount on
declare @sql varchar(1000)

create table #ordered_values (id int identity(1,1),field_value float)
set @sql='insert into #ordered_values select '+@field+' from '+@table+' order by '+@field
exec (@sql)

declare @median_row int
declare @even_ind int
select 
	@median_row=cast(count(*)/2 as int)+1,
	@even_ind=count(*)%2
	from #ordered_values

print @even_ind

if @even_ind=1
	set @sql='select field_value from #ordered_values where id='+cast(@median_row as varchar(10))
else 
	begin
	set @sql='declare @value1 float,@value2 float '
	set @sql=@sql+' select @value1=field_value from #ordered_values where id='+cast(@median_row-1 as varchar(10))
	set @sql=@sql+' select @value2=field_value from #ordered_values where id='+cast(@median_row as varchar(10))
	set @sql=@sql+' select (@value1+@value2)/2 '
	end

exec (@sql)

go

Note that if there are an even number of values you add the middle 2 and divide by 2

============
The Dabbler!

jcelko
Esteemed SQL Purist

547 Posts

Posted - 2002-03-25 : 11:54:19

>> How do I find the median of a set of values in SQL Server" <<

The best way to do this is with a statistical package. It will handle rounding errors and al that stuff for you.

In SQL Server, because it is based on physically contigous storage, a cursor gives the best performance. Just do what you would do in a procedural solution: sort the cursor (file), get the row count (record count) and jump to the middle.

Frankly, I like getting a weighted median since it is more representative of central tendencity. That is if the mid point falls between two values, say {2,3} look at the sub-multisets of those values, say {2,2,3,3,3}, and avderge them. That is ((2+3)/2) = 2.5 for the unweighted median, and ((2+2+3+3+3)/5) = 2.6 for the weighted median.

Anatoly Abramovich, Yelena Alexandrova, and Eugene Birger presented a series of articles in SQL Forum magazine on computing the median (SQL Forum 1993, 1994). In SQL-92 we have a CASE expression, which can be used to rewrite their solutions. The authors distinguished between the statistical median, whose value must be a member of the set, and the financial median, whose value is the average of the middle two members of the set. A statistical median exists when there is an odd number of items in the set. If there is an even number of items, you must decide if you want to use the highest value in the lower half (they call this the left median) or the lowest value in the upper half (they call this the right median).

The left statistical median of a unique column can be found with this query, which finds the median weight of parts from the famous Chrs Date sample database.

SELECT P1.weight
FROM Parts AS P1, Parts AS P2
GROUP BY P1.weight
HAVING SUM(CASE WHEN (P2.weight <= P1.weight) THEN 1 ELSE 0 END)
= (COUNT(*) + 1) / 2;

Changing the direction of the theta test in the HAVING clause will allow you to pick the right statistical median if a central element does not exist in the set. You will also notice something else about the median of a set of unique values: It is usually meaningless. What does the median bin number mean, anyway? A good rule of thumb is that if it does not make sense as an average, it does not make sense as a median.

The statistical median of a column with duplicate values can be found with a query based on the same ideas, but you have to adjust the HAVING clause to allow for overlap; thus, the left statistical median is found by

SELECT P1.weight
FROM Parts AS P1, Parts AS P2
GROUP BY P1.weight
HAVING SUM(CASE WHEN P2.weight <= P1.weight
THEN 1 ELSE 0 END)
>= ((COUNT(*) + 1) / 2)
AND SUM(CASE WHEN P2.weight >= P1.weight
THEN 1 ELSE 0 END)
>= (COUNT(*)/2 + 1);

Notice that here the left and right medians can be the same, so there is no need to pick one over the other in many of the situations where you have an even number of items. Switching the comparison operators in the two CASE expressions will give you the right statistical median.

The author's query for the financial median depends on some Sybase features that cannot be found in other products, so I would recommend using a combination of the right and left statistical medians to return a set of values about the center of the data, and then averaging them, thus:

SELECT AVG(P1.weight)
FROM Parts AS P1, Parts AS P2
HAVING (SUM(CASE WHEN P2.weight <= P1.weight -- left median
THEN 1 ELSE 0 END)
>= ((COUNT(*) + 1) / 2)
AND SUM(CASE WHEN P2.weight >= P1.weight
THEN 1 ELSE 0 END)
>= (COUNT(*)/2 + 1))
OR (SUM(CASE WHEN P2.weight >= P1.weight -- right median
THEN 1 ELSE 0 END)
>= ((COUNT(*) + 1) / 2)
AND SUM(CASE WHEN P2.weight <= P1.weight
THEN 1 ELSE 0 END)
>= (COUNT(*)/2 + 1));

An optimizer may be able to reduce this expression internally, since the expressions involved with COUNT(*) are constants. This entire query could be put into a FROM clause and the average taken of the one or two rows in the result to find the financial median.

I have more on this topic in SQL FOR SMARTIES. To the point it is painful ...

Databases designed for warehouse applications either have built-in median functions or can run these queries fairly fast. Informix has a special "self-join" optimization in its indexes that also helps this kind of query -- the nodes in the index know how many items are above and below them in the table.

--CELKO--
Joe Celko, SQL Guru

davidpardoe
Constraint Violating Yak Guru

324 Posts

Posted - 2002-03-26 : 04:35:48

Ho hum - puts my response somewhat in the shade!

Should it not be Joe Celko - Median Guru!

============================
Chairman of
The NULL Appreciation Society
"Keep NULLs as NULL"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-06-01 : 10:36:27

This is a newer way to calculate both the median and the weighted median with the new windowed functions available in SQL Server 2005 and later.


Declare @Foo table (x INT )
INSERT @Foo VALUES (1), (2), (2), (3), (3), (3)


-- Peso Median 
SELECT AVG(1.0E * x) AS Peso1 
FROM ( 
SELECT x, 
2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS y 
FROM @Foo 
) AS d 
WHERE y BETWEEN 0 AND 2 

-- Peso Weighted Median 
SELECT SUM(1.0E * y) / SUM(1.0E * t) AS Peso2 
FROM ( 
SELECT SUM(x) OVER (PARTITION BY x) AS y, 
2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS z, 
COUNT(*) OVER (PARTITION BY x) AS t 
FROM @Foo 
) AS d 
WHERE z BETWEEN 0 AND 2

N 56°04'39.26"
E 12°55'05.63"

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2010-06-08 : 05:52:12

Do these work in Euro's, Dollars or Yen as well

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources