Return to Reader Challenge #1 Solutions (Part II) ## Reader Challenge #1 Solutions (Part II)Written by This article is the second summary of responses to the Reader Challenge. I'm covering solutions by It might be helpful to review the original challenge and the first correct response I received. I had numerous people submit articles that were pretty similar to what I already published. They are (in order of submission) sriram raghunathan, JP, Marc Yabroff, Jon Scott-Sheldon, Allison Anderson, Balachandar Ganesan, Justin Bigelow (who got me started on this), Eugene Okhrits and John Henry. I also had late entries from sqlnr, Pedro Cardoso and Tim Graham that weren't radically different from the other entries. Merkin also submitted a partial solution but he's written an article for us so he wasn't "officially" eligible. ## A Clean SolutionThis is a pretty subjective category. I tried to identify a solution was easy for a novice SQL developer to read and understand. I think in the future I'll present this one first and work my way up to the more complex. I thought the "cleanest" solution came from
I can't comment on Ilya's gambling skills but his SQL skills seem to be pretty good. Here's the query he submitted: DECLARE @Mean float DECLARE @StDev float DECLARE @LowerBound float DECLARE @UpperBound float --Get Mean and StDev SELECT @Mean = AVG(Ind_Ret), @StDev = STDEV(Ind_Ret) FROM Samples ----Get Upper and Lower bounds of the range SELECT @LowerBound = @Mean - (@StDev * 3), @UpperBound = @Mean + (@StDev * 3) ----Count the outliers SELECT (SELECT Count(Ind_Ret) as Outliers_Below FROM Samples WHERE Ind_Ret<@LowerBound) as Outliers_Below , (SELECT Count(Ind_Ret) as Outliers_Below FROM Samples WHERE Ind_Ret>@UpperBound) as Outliers_Above I think this lays out the steps that the query has to go through pretty thoroughly. Ilya also gets bonus points for using comments. This query generates three table scans which isn't all that bad. I've got solutions with two table scans but I've also got solutions with five table scans. ## Returning the Values
SELECT so.ind_ret AS OverYakLimit, su.ind_ret AS UnderYakLimit FROM samples s LEFT JOIN samples so ON s.ind_ret = so.ind_ret AND so.ind_ret >(SELECT ISNULL((AVG(Ind_Ret) + STDEV(Ind_Ret) * 3),0) AS CountstdCountOverLimit FROM Samples) LEFT JOIN samples su ON s.ind_ret = su.ind_ret AND su.ind_ret <(SELECT ISNULL((AVG(Ind_Ret) - STDEV(Ind_Ret) * 3),0) AS CountstdCountOverLimit FROM Samples) WHERE (ISNULL(so.ind_ret,0) + ISNULL(su.ind_ret,0)) > 0 ORDER BY so.ind_ret, su.ind_ret It generates five table scans but does return all the values. Plus he gets bonus points for a few Yak references. ## Two Table ScansI don't think this query can be written in a single table scan. You need one full scan to determine the standard deviation and average. You need a second scan to identify records that are more than three standard deviations from the average. I got two solutions with two table scans from
And here's the solution he came up with: Select Sum( case when Ind_Ret > NPLimitTop Then 1 Else 0 End ) As AboveNPLimit, Sum( case when Ind_Ret < NPLimitBottom Then 1 Else 0 End ) As BelowNPLimit From Samples, ( Select Avg(Ind_Ret) + (Stdev(Ind_Ret) * 3) AS NPLimitTop, Avg(Ind_Ret) - (Stdev(Ind_Ret) * 3) AS NPLimitBottom From Samples ) RangeData You'll notice his tricky use of a cartesian join to pull NPLimitTop and NPLimitBottom into his case statement. This is very similar to what he did in the first article I wrote. This solution only uses two table scans which is as fast as I think it can be done.
SELECT SUM(Below) as Outliers_Below, -- Count outliers SUM(Above) as Outliers_Above FROM (SELECT -- Break up high-end and low-end outliers into separate cols CASE WHEN a.Ind_Ret > b.UpperBound THEN 1 ELSE 0 END as Above, CASE WHEN a.Ind_Ret < b.LowerBound THEN 1 ELSE 0 END as Below FROM Samples as a, -- Obtain range boundaries (SELECT AVG(Ind_Ret)+(3*STDEV(Ind_Ret)) as UpperBound, AVG(Ind_Ret)-(3*STDEV(Ind_Ret)) as LowerBound FROM Samples) as b -- Eliminate non-outlier values WHERE (a.Ind_Ret > b.UpperBound OR a.Ind_Ret < b.LowerBound)) as Yaks ## An Index?
Here's his statement: CREATE CLUSTERED INDEX IX_Samples ON Samples ( Ind_Ret ) ON [PRIMARY] SELECT CASE WHEN ind_ret - average > 0 THEN 'Above' ELSE 'Below' END as Direction , isnull(cast(ind_ret as varchar(20)), 'TOTAL') as Ind_ret , count(*) as CountofSamples FROM SAMPLES CROSS JOIN ( SELECT avg(s.ind_ret) average , STDEV(s.Ind_Ret) stddev FROM SAMPLES s ) as AggFnResults WHERE abs(ind_ret - average) > stddev * 3 GROUP BY CASE WHEN ind_ret - average > 0 THEN 'Above' ELSE 'Below' END , ind_ret WITH ROLLUP ORDER BY 1,2 DESC His query only generated two tables scans before adding the index so it was already pretty efficient. He does have some extra GROUP BY statements that generate some additional overhead though. After adding the index the Table Scans were converted to Clustered Index Scans. They seemed to take the exact same amount of processing time that the table scan did which is what I'd expected. Obviously in a real environment there would have been multiple tests stored in the same table and indexes would make a huge difference. Of course, in a real environment proper indexes would have been built from the start and all the queries would have used them. So in this case using an index didn't really help or hurt this solution much. But it was interesting to see their effect on the query. Thanks to everyone that entered. I think I'll do another one of these in a few weeks. I've got a couple of queries that would make good candidates. Thanks for "playing" and I hope you all give it a try next time. |