Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Median Checking Process very slow

Author  Topic 

Sean_B
Posting Yak Master

111 Posts

Posted - 2009-01-09 : 06:52:30
Hi,

I've written the following sql to median check a number of trade values, each trade is compared to the median of the last 100 trades to see if it looks like a realistic value.

---start
declare @ThresholdPercent float; SET @ThresholdPercent = 100

SET @ThresholdPercent = (100 + @ThresholdPercent) /100

declare @Window INT;SET @window = 101
declare @median INT;
SET @window = 100;



SELECT tradeid,tdate,tclose, mclose
FROM
(
SELECT tradeid,tdate,tclose, mclose, RANK() OVER (PARTITION BY tradeid ORDER BY mclose asc) id
from
(
SELECT p.tradeid,p.tdate,p.tclose,x.tdate mdate,x.tclose mclose
FROM #temp p
CROSS apply (SELECT TOP 100 * FROM #temp t
where t.tDate < P.tDate
order BY t.tDate desc) x
) x
) y
WHERE y.id = @window / 2
AND (tclose > (mclose * @ThresholdPercent) OR tclose > (mclose * @ThresholdPercent))


---end

Heres a loop to inset some test data

--start
DROP TABLE #temp

CREATE TABLE #temp
(TradeID INT IDENTITY(1,1),tDate DATETIME,tClose DECIMAL(18,6)
)
CREATE CLUSTERED INDEX ix_Date ON #temp(tDate)


DECLARE @c int;SET @c = 0
DECLARE @tDate DATETIME
DECLARE @tclose decimal(18,6)
SELECT @tDate = '01/01/2008'
SELECT @tclose = 10.5

WHILE @c < 10000
BEGIN
INSERT INTO #temp (tDate,tClose) SELECT @tDate,@tclose
SELECT @tDate = DATEADD(minute,5,@tDate)
SELECT @tclose = @tclose + 1

SET @c = @c + 1
END

-----end

This works ok on small numbers, but once I get to checking millions of rows it slows right down.

I've tried seperating the inner queries into intermediary stages with temporary tables that I've indexed, but it is still taking 20 plus minutes to check a million rows.

Any suggestions on speeding this up would be appreciated.









Sean

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-01-09 : 15:55:13
a running median of previous 100 rows over 1M rows is going to be painful no matter how you do it. Is tradeid an unbroken sequence in the same order as tDate? If so see if this is any better:

btw, median is the average of the middle two values in a set with an even number of constituents so I incorporated that.

declare @ThresholdPercent float
SET @ThresholdPercent = 100
SET @ThresholdPercent = (100 + @ThresholdPercent) /100

select t.Tradeid
,t.tDate
,t.tClose
,avg(ca.tClose) mClose
from #temp t
cross apply (
select tradeid
,tClose
,row_number() over (order by tClose) rn
from #temp a
where a.tradeid <= t.tradeid + 100
and a.tradeid > t.tradeid
) ca
where ca.rn in (50,51)
group by t.Tradeid
,t.tDate
,t.tClose
having t.tclose > avg(ca.tclose) * @ThresholdPercent


Be One with the Optimizer
TG
Go to Top of Page

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2009-01-10 : 07:58:25
Sean,

You definitely need to use ROW_NUMBER() not RANK() here: if you have duplicate values the "WHERE y.id = @window / 2" condition may fail with RANK().

Not an error, but there's a curious repeated condition "(tclose > (mclose * @ThresholdPercent) OR tclose > (mclose * @ThresholdPercent))". I'm not sure what you intended here.

I'd be inclined to use an odd window size to sidestep the question of taking the mean of the middle two values!

However, the main problem with the current query is that, for a #temp with 1,000,000 rows the execution plan will include a sort of a temporary table of 100,000,000 rows. By calculating the median within the CROSS APPLY you replace that sort with 1,000,000 sorts of 100 rows each.

SELECT P.tradeid, P.tdate, P.tclose, T.mclose
FROM #temp AS P
CROSS APPLY (
SELECT T.tradeid, T.tdate, T.tclose AS mclose
FROM (
SELECT T.*, ROW_NUMBER() OVER (ORDER BY T.tclose ASC) AS id
FROM (
SELECT TOP 101 T.*
FROM #temp AS T
WHERE T.tdate < P.tdate
ORDER BY T.tdate DESC
) AS T
) AS T
WHERE T.id = 51
) AS T
WHERE P.tclose > (T.mclose * @ThresholdPercent)

For 1,000,000 rows I get 73 seconds, versus 372 seconds for your original (modified to use ROW_NUMBER(), have the same window size, remove the duplicate condition):

SELECT tradeid,tdate,tclose, mclose
FROM
(
SELECT tradeid,tdate,tclose, mclose, ROW_NUMBER() OVER (PARTITION BY tradeid ORDER BY mclose asc) id
from
(
SELECT p.tradeid,p.tdate,p.tclose,x.tdate mdate,x.tclose mclose
FROM #temp p
CROSS apply (SELECT TOP 101 * FROM #temp t
where t.tdate < p.tdate
order BY t.tdate desc) x
) x
) y
WHERE y.id = 51
AND tclose > (mclose * @ThresholdPercent)

Go to Top of Page

Sean_B
Posting Yak Master

111 Posts

Posted - 2009-01-12 : 04:37:50
Thanks this is a big help, I wouldn't have thought of calculating the median within the cross apply.

Sean
Go to Top of Page

Sean_B
Posting Yak Master

111 Posts

Posted - 2009-01-12 : 06:50:42
Is there any way to ignore those entries where there are less than 100 results returned by the select top 101, this is only a problem at the start of the dataset, where we need to ignore the first 100 records. Is there any way of doing this within the select top 101 sub query ?

Sean
Go to Top of Page
   

- Advertisement -