Return to Reader Challenge #1 Solutions (Part II)
Reader Challenge #1 Solutions (Part II)
Written by Bill Graziano on 04 June 2001
This article is the second summary of responses to the Reader Challenge. I'm covering solutions by Michael Price (again), Ilya Zaltsman, John Nowak and Stephen Morris. These are the fastest and cleanest solutions. I'll also cover the first entry that returned the actual values. Thanks again for your entries and I'm busy looking for another query for the next Reader Challenge.
It might be helpful to review the original challenge and the first correct response I received. I had numerous people submit articles that were pretty similar to what I already published. They are (in order of submission) sriram raghunathan, JP, Marc Yabroff, Jon Scott-Sheldon, Allison Anderson, Balachandar Ganesan, Justin Bigelow (who got me started on this), Eugene Okhrits and John Henry. I also had late entries from sqlnr, Pedro Cardoso and Tim Graham that weren't radically different from the other entries. Merkin also submitted a partial solution but he's written an article for us so he wasn't "officially" eligible.
A Clean Solution
This is a pretty subjective category. I tried to identify a solution was easy for a novice SQL developer to read and understand. I think in the future I'll present this one first and work my way up to the more complex. I thought the "cleanest" solution came from Ilya Zaltsman. He writes:
First of all, thanks for the site -- to me it is the best source of SQL information anywhere!!!
I thought I'd try my hand at answering the reader challege:
Statistics... The subject I've been trying to forget (especially since I blew $350 in a casino last night)...
I can't comment on Ilya's gambling skills but his SQL skills seem to be pretty good. Here's the query he submitted:
DECLARE @Mean float
DECLARE @StDev float
DECLARE @LowerBound float
DECLARE @UpperBound float
--Get Mean and StDev
SELECT @Mean = AVG(Ind_Ret),
@StDev = STDEV(Ind_Ret) FROM Samples
----Get Upper and Lower bounds of the range
SELECT @LowerBound = @Mean - (@StDev * 3),
@UpperBound = @Mean + (@StDev * 3)
----Count the outliers
(SELECT Count(Ind_Ret) as Outliers_Below
WHERE Ind_Ret<@LowerBound) as Outliers_Below ,
(SELECT Count(Ind_Ret) as Outliers_Below
WHERE Ind_Ret>@UpperBound) as Outliers_Above
I think this lays out the steps that the query has to go through pretty thoroughly. Ilya also gets bonus points for using comments. This query generates three table scans which isn't all that bad. I've got solutions with two table scans but I've also got solutions with five table scans.
Returning the Values
John Nowak sent in the first solution that returned the actual values in a single SELECT statement. Here's his solution:
SELECT so.ind_ret AS OverYakLimit,
su.ind_ret AS UnderYakLimit
FROM samples s
LEFT JOIN samples so
ON s.ind_ret = so.ind_ret
AND so.ind_ret >(SELECT ISNULL((AVG(Ind_Ret) + STDEV(Ind_Ret) * 3),0)
LEFT JOIN samples su
ON s.ind_ret = su.ind_ret
AND su.ind_ret <(SELECT ISNULL((AVG(Ind_Ret) - STDEV(Ind_Ret) * 3),0)
WHERE (ISNULL(so.ind_ret,0) + ISNULL(su.ind_ret,0)) > 0
ORDER BY so.ind_ret, su.ind_ret
It generates five table scans but does return all the values. Plus he gets bonus points for a few Yak references.
Two Table Scans
I don't think this query can be written in a single table scan. You need one full scan to determine the standard deviation and average. You need a second scan to identify records that are more than three standard deviations from the average. I got two solutions with two table scans from Michael Price and Ilya Zaltsman. Michael was first and writes (in his second email):
Well, when driving home last night I was thinking about this some more, and came up with a qry that only uses 2 table scans . . . Now maybe I can get this out of my head <g>, I don't think a single pass on the table is possible, but look forward to the other users responses to see if they came up with a way.
And here's the solution he came up with:
Select Sum( case
when Ind_Ret > NPLimitTop Then 1
Else 0 End ) As AboveNPLimit,
when Ind_Ret < NPLimitBottom Then 1
Else 0 End ) As BelowNPLimit
From Samples, (
Select Avg(Ind_Ret) + (Stdev(Ind_Ret) * 3) AS NPLimitTop,
Avg(Ind_Ret) - (Stdev(Ind_Ret) * 3) AS NPLimitBottom
You'll notice his tricky use of a cartesian join to pull NPLimitTop and NPLimitBottom into his case statement. This is very similar to what he did in the first article I wrote. This solution only uses two table scans which is as fast as I think it can be done.
Ilya's solution with two table scans was different and I thought I'd publish it too.
SUM(Below) as Outliers_Below, -- Count outliers
SUM(Above) as Outliers_Above
-- Break up high-end and low-end outliers into separate cols
WHEN a.Ind_Ret > b.UpperBound THEN 1
END as Above,
WHEN a.Ind_Ret < b.LowerBound THEN 1
END as Below
FROM Samples as a,
-- Obtain range boundaries
(SELECT AVG(Ind_Ret)+(3*STDEV(Ind_Ret)) as UpperBound,
AVG(Ind_Ret)-(3*STDEV(Ind_Ret)) as LowerBound
FROM Samples) as b
-- Eliminate non-outlier values
WHERE (a.Ind_Ret > b.UpperBound OR a.Ind_Ret < b.LowerBound)) as Yaks
Stephen Morris submitted an interesting solution. He built an index on the table. Stephen writes
got sucked in to your Reader Challenge. Added an index to get rid of the table scans & wrote a single select solution - see below
got to get back to work
Here's his statement:
CREATE CLUSTERED INDEX IX_Samples ON Samples
) ON [PRIMARY]
WHEN ind_ret - average > 0
END as Direction
, isnull(cast(ind_ret as varchar(20)), 'TOTAL') as Ind_ret
, count(*) as CountofSamples
, STDEV(s.Ind_Ret) stddev
FROM SAMPLES s
) as AggFnResults
WHERE abs(ind_ret - average) > stddev * 3
CASE WHEN ind_ret - average > 0 THEN 'Above' ELSE 'Below' END
ORDER BY 1,2 DESC
His query only generated two tables scans before adding the index so it was already pretty efficient. He does have some extra GROUP BY statements that generate some additional overhead though. After adding the index the Table Scans were converted to Clustered Index Scans. They seemed to take the exact same amount of processing time that the table scan did which is what I'd expected.
Obviously in a real environment there would have been multiple tests stored in the same table and indexes would make a huge difference. Of course, in a real environment proper indexes would have been built from the start and all the queries would have used them. So in this case using an index didn't really help or hurt this solution much. But it was interesting to see their effect on the query.
Thanks to everyone that entered. I think I'll do another one of these in a few weeks. I've got a couple of queries that would make good candidates. Thanks for "playing" and I hope you all give it a try next time.