By definition, a percentile is the score at or below which a particular percentage of scores fall. The code given here was tested on Microsoft SQL Server 2000. Using SQL may be the wrong tool for the job if the number of data points is very large. An alternative is to use Analysis Services which provides access to EXCEL worksheet functions for use in MDX queries.
We’ll present solutions for two different problems. One is to find the pth percentile that a score falls in. The other problem is to find the score (the percentile) that characterizes a given percentage of the distribution.
Find the pth percentile of a given score
Let’s create a table variable containing a set of scores.
Declare @TestScores table (StudentID int, Score int)
insert @TestScores (StudentID, Score) Values (1, 20)
insert @TestScores (StudentID, Score) Values (2, 03)
insert @TestScores (StudentID, Score) Values (3, 40)
insert @TestScores (StudentID, Score) Values (4, 45)
insert @TestScores (StudentID, Score) Values (5, 50)
insert @TestScores (StudentID, Score) Values (6, 20)
insert @TestScores (StudentID, Score) Values (7, 90)
insert @TestScores (StudentID, Score) Values (8, 20)
insert @TestScores (StudentID, Score) Values (9, 11)
insert @TestScores (StudentID, Score) Values (10, 30)
Now let’s find the percentage of scores at or below the score 20.
select CAST(SUM(CASE WHEN Score <= 20
THEN 1 ELSE 0 END) as float)/COUNT(*) as PthPercentileFor20
from @TestScores
PthPercentileFor20
------------------
0.5
A score of 20 is the 50th percentile in the TestScores distribution. In other words, half of the students scored at or below 20. The SQL statement counts the number of scores below 20 then divides by the total number of scores in the table. The CAST forces the division to be a floating point division instead of an integer division.
To simplify, let’s define a user defined function for floating point division.
CREATE FUNCTION dbo.FDIV
(@numerator float,
@denominator float)
RETURNS float
AS
BEGIN
RETURN CASE WHEN @denominator = 0.0
THEN 0.0
ELSE @numerator / @denominator
END
END
GO
Using this function the code can be rewritten as:
select dbo.FDIV(SUM(CASE WHEN Score <= 20
THEN 1 ELSE 0 END), COUNT(*)) as PthPercentileFor20
from @TestScores
This code yields reliable results when the distribution is defined by many samples.
Interpolating
Consider an extreme case where we only have 2 samples with values 3 and 90 defining the distribution. How does a score of 20 rank? The code above would yield 0.5. A “better” estimate of the rank is obtained by linear interpolation.
rank = (20-3)/(90-3) = 17/87 = 0.1954
Let’s create a user defined function to do the linear interpolation.
CREATE FUNCTION dbo.LERP
(@value float, -- between low and high
@low float,
@high float,
@newlow float,
@newhigh float)
RETURNS float -- between newlow and newhigh
AS
BEGIN
RETURN CASE
WHEN @value between @low and @high and @newlow <= @newhigh
THEN @newlow+dbo.FDIV((@value-@low),
@high-@low))*(@newhigh-@newlow)
WHEN @value = @low and @newlow is not NULL THEN @newlow
WHEN @value = @high and @newhigh is not NULL THEN @newhigh
ELSE NULL
END
END
GO
Using this LERP function we can code the above formula as:
select dbo.LERP(20,3,90,0.0,1.0) as rank
Read this as: given a value 20 between 3 and 90 return a value having the same proportion between 0.0 and 1.0. The result returned is 0.1954.
Find the percentile rank of a given score
The percentile rank is useful when the distribution is sparse (defined by relatively few samples). Interpolating between the pth percentile of the nearest samples provides a reasonable estimate of the pth percentile of a given score.
This calculation is modeled after the percentilerank function in EXCEL. Note that the results may not agree with the first solution shown above. However, the percentilerank does have some nice properties. The percentilerank(min score) is 0.0, the percentilerank(max score) is 1.0, and the percentilerank(median score) is 0.5.
declare @val float
set @val = 25
select
@val as val,
dbo.LERP(@val, scoreLT, scoreGE,
dbo.FDIV(countLT-1,countMinus1),
dbo.FDIV(countLT,countMinus1)) as percentrank
from (
select
SUM(CASE WHEN Score < @val
THEN 1 ELSE 0 END) as countLT,
count(*)-1 as countMinus1,
MAX(CASE WHEN Score < @val
THEN Score END) as scoreLT,
MIN(CASE WHEN Score >= @val
THEN Score END) as scoreGE
from @TestScores
) as x1
The derived table makes one scan over the data values to compute some aggregates. The outer select interpolates between the pth percentile of the nearest samples below and above the given value.
The result is 25, 0.5. That means a score of 25 is at the 50th percentile, the median, of the distribution.
Find the percentile (the score) that characterizes a given percentage
This calculation is modeled after the percentile function in EXCEL. It computes the inverse of the percentile rank function except when there are duplicate values in the samples which define the distribution.
declare @pp float
set @pp = .5
select
@pp as factor,
dbo.LERP(max(d), 0.0, 1.0, max(a.Score), max(b.Score)) as percentile
from
(
select floor(kf) as k, kf-floor(kf) as d
from (
select 1+@pp*(count(*)-1) as kf from @TestScores
) as x1
) as x2
join @TestScores a
on
(select count(*) from @TestScores aa
where aa.Score < a.Score) < k
join @TestScores b
on
(select count(*) from @TestScores bb
where bb.Score < b.Score) < k+1
The derived table x2 converts the factor @pp to a position k and a fractional remainder d. Table a is used to find the kth score. Table b is used to find the k+1th score. The outer select interpolates between the kth score and k+1th score based on the remainder d. The max(d) is just a trick. We know derived table x2 will only produce one value for d. We communicate this to SQL by using max(d) to create one value for d in the general case.
The result of this query is .5, 25 indicating that the 50th percentile (the median) of the distribution is the score 25.
References
[1] College Board. SAT Percentiles.
[2] Analysis Services Excel functions.
[3] NIST Engineering Statistics Handbook. Percentiles.
Bob Newstadt is a seasoned project manager, application developer, and algorithm designer. For the past 8 years he has consulted on data quality initiatives and has lead development teams in the financial services industry. He is available to assist on your next SQL Server project. This article (c) Copyright 2004 Bob Newstadt and reprinted with permission.