Return to Computing the Trimmed Mean in SQL ## Computing the Trimmed Mean in SQLWritten by This article by Bob Newstadt presents code to compute a trimmed mean in SQL. The trimmed mean is a more robust version of the simple mean (SQL AVG() aggregate function). It is a useful tool for summarizing ill-behaved real world data. We all use statistics to help us understand the world. Think of batting averages, grade point averages, or the oft-quoted median price of a single family home. However, averages of real word data can be misleading. Two common problems are having too few samples or having wild values known as outliers. A widely applicable technique to deal with these issues is the This article presents several ways to compute a trimmed mean in SQL. Among the solutions is code which yields the mean, the median, or something in between depending on the amount of trimming you specify. These queries have been tested using SQL Server 2000 and may use some non-standard extensions. ## Sample dataLet’s create a table variable containing a set of scores. Declare @TestScores table (StudentID int, Score int) insert @TestScores (StudentID, Score) Values (1, 20) insert @TestScores (StudentID, Score) Values (2, 03) insert @TestScores (StudentID, Score) Values (3, 40) insert @TestScores (StudentID, Score) Values (4, 45) insert @TestScores (StudentID, Score) Values (5, 50) insert @TestScores (StudentID, Score) Values (6, 20) insert @TestScores (StudentID, Score) Values (7, 90) insert @TestScores (StudentID, Score) Values (8, 20) insert @TestScores (StudentID, Score) Values (9, 11) insert @TestScores (StudentID, Score) Values (10, 30) The average is thrown off by the “curve-wrecking” student 7 who earned a score of 90. select avg(cast(score as float)) as mean from @testscores 32.899999999999999 If that high score were 900 the mean would be really out of whack, all due to one far out value. ## Trimming the smallest and largest valuesThis trick will compute the mean excluding the smallest and largest values. select (sum(score)-min(score)-max(score)) / cast(count(*)-2 as float) as meantrimmedby1 from @testscores 29.5 Of course there must be more than 2 scores for this to work. ## Trimming the N smallest and largest valuesThis code removes the smallest N and largest N scores before computing the average. N is a variable set at run time. declare @N int set @N = 3 select @N as N, avg(cast(score as float)) as TrimmedMeanN from @TestScores a where (select count(*) from @TestScores aa where aa.Score <= a.Score) > @N and (select count(*) from @TestScores bb where bb.Score >= a.Score) > @N 3, 26.0 The where clause keeps only the scores which fall between the N largest and the N smallest values. The correlated subqueries in the where clause rank each a.score compared to all scores in @TestScores. Duplicate values are either all removed or all retained. For example, if all scores are the same then none of them will be trimmed no matter what N is. Apply this algorithm only when there are at least 2N scores. ## Trimming the smallest and largest percentileA more general approach is to trim by a fixed percentage instead of a fixed number. Here we trim by a factor between 0.0 and 0.5. Trimming by 0.0 trims nothing yielding the mean. Trimming by .25 discards the scores in the top and bottom quartiles and averages what’s left. Trimming by .5 yields the weighted median. The median is weighted when there are duplicate values. In this example the central values are 20,20,20,30 which average out to 22.5. Compare this to the non-weighted median 25.0 ((20+30)/2). declare @pp float set @pp = .5 select @pp as factor, avg(cast(score as float)) as TrimmedMeanP from @TestScores a where (select count(*) from @TestScores aa where aa.Score <= a.Score) >= (select @pp*count(*) from @TestScores) and (select count(*) from @TestScores bb where bb.Score >= a.Score) >= (select @pp*count(*) from @TestScores) .5, 22.5 This code is similar to the previous query except @N is replaced by @pp*count(*). The relation > was changed to >= so that a factor of .5 generates the weighted median instead of trimming all samples. We can rewrite this solution to improve performance. The following code cuts the number of table scans in half. declare @pp float set @pp = .5 select @pp as factor, sum(cast(score as float)*weight)/sum(weight) as TrimmedMeanP2 from ( select a.score, count(*) as weight from @TestScores a cross join @TestScores b group by a.score having sum(case when b.Score <= a.Score then 1 else 0 end) >= @pp*count(*) and sum(case when b.Score >= a.Score then 1 else 0 end) >= @pp*count(*) ) as x1 .5, 22.5 The @TestScores table is cross joined with itself to permit comparisons of every score in the table ‘a’ with every score in table ‘b’. The results are grouped by a.score. Thus there will be at most one row in the derived table for every distinct value of a.score. In this example there are 3 scores with the value 20 causing the join to evaluate 30 rows (3*10) for that group. The having clause retains those a.score groups near the center of the distribution. The derived table generates a weight with each score proportional to the number of duplicate values there are for that score in the original table. Finally, the outer select calculates the weighted average of the retained grouped scores. ## Trimming using TOPYou may be thinking: Why go to all this trouble when TOP and ORDER BY can easily filter rows from a table? The TOP operator has some limitations which are inconvenient to work around. TOP’s argument N can not be a variable. Until you upgrade to Yukon, the next version of SQL SERVER 2000, you will need to resort to dynamic SQL if N is variable. Many DBAs try to avoid dynamic SQL for security and performance reasons. TOP applies to the whole result set. This makes it hard to compose some complex queries which depend on TOP. Consider computing the trimmed mean of each student’s scores. Using TOP you would need to have a cursor to process each student’s scores separately. The query from the previous section can be extended to handle this problem without using cursors. If just a single result is required and if the amount of trimming is not variable then using TOP may work. Here’s an example of computing the left median using TOP. select top 1 Score as medianByTOP from (select top 50 percent Score from @TestScores order by Score) as x order by Score desc 20 This code takes the max value in the bottom half of the distribution. A 25% trimmed mean using TOP can be coded as: select avg(cast(score as float)) as TrimmedMean25pByTOP from (select top 66.666 percent Score from (select top 75 percent Score from @TestScores order by Score desc) as x order by Score) as y 29.166666666666668 The inner derived table trims the lowest 25%. The outer derived table trims the highest 25% of the original. The select clause averages the middle 50% of the distribution (66.666% of 75%=50%). ## Further Reading[1] Joe Celko’s SQL for Smarties: Advanced SQL programming 2nd Edition. Morgan Kaufman. 2000. [2] Ales Spetic & Jonathan Gennick, Transact-SQL Cookbook. O’Reilly. 2002. Bob Newstadt is a seasoned project manager, application developer, and algorithm designer. For the past 8 years he has consulted on data quality initiatives and has lead development teams in the financial services industry. He is available to assist on your next SQL Server project. This article (c) Copyright 2004 Bob Newstadt and reprinted with permission. |