Computing Percentiles in SQL Server

By Guest Authors on 20 May 2004 | Tags: SELECT


Say you score 570 on the verbal section of the SAT. How well did you do? According to the College Board, 570 falls into the 71st percentile. This means you did better than 71% of the other students who took the test. Percentiles are a useful way to present rankings among a large population. You may want to add percentiles to reports you produce with SQL. This article by Bob Newstadt shows how to calculate a percentile for a given group of scores. It also covers finding a score for a given percentile.

By definition, a percentile is the score at or below which a particular percentage of scores fall. The code given here was tested on Microsoft SQL Server 2000. Using SQL may be the wrong tool for the job if the number of data points is very large. An alternative is to use Analysis Services which provides access to EXCEL worksheet functions for use in MDX queries.

We’ll present solutions for two different problems. One is to find the pth percentile that a score falls in. The other problem is to find the score (the percentile) that characterizes a given percentage of the distribution.

Find the pth percentile of a given score

Let’s create a table variable containing a set of scores.

Declare @TestScores table (StudentID int, Score int)
insert @TestScores (StudentID, Score) Values (1,  20)
insert @TestScores (StudentID, Score) Values (2,  03)
insert @TestScores (StudentID, Score) Values (3,  40)
insert @TestScores (StudentID, Score) Values (4,  45)
insert @TestScores (StudentID, Score) Values (5,  50)
insert @TestScores (StudentID, Score) Values (6,  20)
insert @TestScores (StudentID, Score) Values (7,  90)
insert @TestScores (StudentID, Score) Values (8,  20)
insert @TestScores (StudentID, Score) Values (9,  11)
insert @TestScores (StudentID, Score) Values (10, 30)

Now let’s find the percentage of scores at or below the score 20.

select CAST(SUM(CASE WHEN Score <= 20 
		THEN 1 ELSE 0 END) as float)/COUNT(*) as PthPercentileFor20
from @TestScores

PthPercentileFor20
------------------
0.5

A score of 20 is the 50th percentile in the TestScores distribution. In other words, half of the students scored at or below 20. The SQL statement counts the number of scores below 20 then divides by the total number of scores in the table. The CAST forces the division to be a floating point division instead of an integer division.

To simplify, let’s define a user defined function for floating point division.

CREATE FUNCTION dbo.FDIV 
	(@numerator float, 
	 @denominator float)
RETURNS float
AS
BEGIN
	RETURN CASE WHEN @denominator = 0.0 
		THEN 0.0
		ELSE @numerator / @denominator
		END
END
GO

Using this function the code can be rewritten as:

select dbo.FDIV(SUM(CASE WHEN Score <= 20 
		THEN 1 ELSE 0 END), COUNT(*)) as PthPercentileFor20
from @TestScores

This code yields reliable results when the distribution is defined by many samples.

Interpolating

Consider an extreme case where we only have 2 samples with values 3 and 90 defining the distribution. How does a score of 20 rank? The code above would yield 0.5. A “better” estimate of the rank is obtained by linear interpolation. rank = (20-3)/(90-3) = 17/87 = 0.1954

Let’s create a user defined function to do the linear interpolation.

CREATE FUNCTION dbo.LERP 
	(@value float, -- between low and high
	 @low float,
	 @high float,
	 @newlow float,
	 @newhigh float)
RETURNS float -- between newlow and newhigh
AS
BEGIN
  RETURN CASE 
      WHEN @value between @low and @high and @newlow <= @newhigh
        THEN @newlow+dbo.FDIV((@value-@low),
	  @high-@low))*(@newhigh-@newlow)
      WHEN @value = @low and @newlow is not NULL THEN @newlow
      WHEN @value = @high and @newhigh is not NULL THEN @newhigh
      ELSE NULL
	END
END
GO

Using this LERP function we can code the above formula as:

select dbo.LERP(20,3,90,0.0,1.0) as rank

Read this as: given a value 20 between 3 and 90 return a value having the same proportion between 0.0 and 1.0. The result returned is 0.1954.

Find the percentile rank of a given score

The percentile rank is useful when the distribution is sparse (defined by relatively few samples). Interpolating between the pth percentile of the nearest samples provides a reasonable estimate of the pth percentile of a given score.

This calculation is modeled after the percentilerank function in EXCEL. Note that the results may not agree with the first solution shown above. However, the percentilerank does have some nice properties. The percentilerank(min score) is 0.0, the percentilerank(max score) is 1.0, and the percentilerank(median score) is 0.5.

declare @val float
set @val = 25

select 
	@val as val,
	dbo.LERP(@val, scoreLT, scoreGE, 
		dbo.FDIV(countLT-1,countMinus1), 
		dbo.FDIV(countLT,countMinus1)) as percentrank
from (
	select 
		SUM(CASE WHEN Score < @val 
			THEN 1 ELSE 0 END) as countLT,
		count(*)-1 as countMinus1,
		MAX(CASE WHEN Score < @val
			THEN Score END) as scoreLT,
		MIN(CASE WHEN Score >= @val
			THEN Score END) as scoreGE
		from @TestScores
) as x1

The derived table makes one scan over the data values to compute some aggregates. The outer select interpolates between the pth percentile of the nearest samples below and above the given value.

The result is 25, 0.5. That means a score of 25 is at the 50th percentile, the median, of the distribution.

Find the percentile (the score) that characterizes a given percentage

This calculation is modeled after the percentile function in EXCEL. It computes the inverse of the percentile rank function except when there are duplicate values in the samples which define the distribution.

declare @pp float
set @pp = .5

select 
	@pp as factor, 
	dbo.LERP(max(d), 0.0, 1.0, max(a.Score), max(b.Score)) as percentile
from
	(
	select floor(kf) as k, kf-floor(kf) as d
	from	(
		select 1+@pp*(count(*)-1) as kf from @TestScores
		) as x1
	) as x2
join @TestScores a
on 
(select count(*) from @TestScores aa
	where aa.Score < a.Score) < k
join @TestScores b
on 
(select count(*) from @TestScores bb
	where bb.Score < b.Score) < k+1

The derived table x2 converts the factor @pp to a position k and a fractional remainder d. Table a is used to find the kth score. Table b is used to find the k+1th score. The outer select interpolates between the kth score and k+1th score based on the remainder d. The max(d) is just a trick. We know derived table x2 will only produce one value for d. We communicate this to SQL by using max(d) to create one value for d in the general case.

The result of this query is .5, 25 indicating that the 50th percentile (the median) of the distribution is the score 25.

References

[1] College Board. SAT Percentiles.

[2] Analysis Services Excel functions.

[3] NIST Engineering Statistics Handbook. Percentiles.


Bob Newstadt is a seasoned project manager, application developer, and algorithm designer. For the past 8 years he has consulted on data quality initiatives and has lead development teams in the financial services industry. He is available to assist on your next SQL Server project. This article (c) Copyright 2004 Bob Newstadt and reprinted with permission.


Related Articles

Joining to the Next Sequential Row (2 April 2008)

Writing Outer Joins in T-SQL (11 February 2008)

How to Use GROUP BY with Distinct Aggregates and Derived tables (31 July 2007)

How to Use GROUP BY in SQL Server (30 July 2007)

SQL Server 2005: Using OVER() with Aggregate Functions (21 May 2007)

Server Side Paging using SQL Server 2005 (4 January 2007)

Using XQuery, New Large DataTypes, and More (9 May 2006)

Counting Parents and Children with Count Distinct (10 January 2006)

Other Recent Forum Posts

Separate values which are not delimited (6h)

Finding Number of Virtual Cores Assigned to SQL Guests (1d)

Basic SQL Query question (1d)

SQL Condition by user id (1d)

Count occurrences by time (2d)

AlwaysOn AG + Replication maintenance - two scenarios to get the job done (7d)

What happens in a dual LEFT OUTER join when the second join is NULL in both tables? (7d)

How to set a variable from a table with comma? (8d)

- Advertisement -