Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 New to SQL Server Programming
 Median absolute deviation with window ordering

Author  Topic 

leshka_uk
Starting Member

3 Posts

Posted - 2014-06-09 : 11:45:23
Dear all, I am trying to find the most efficient way (i.e. avoiding cursors and tmp tables) to calculate Median absolute deviation in a set with window ordering (i.e. breaking long dataset into groups of 10 rows. group 1 -> row 1 to 10; group 2 -> row 2 to 11; etc.).

background reading about Median absolute deviation
http://en.wikipedia.org/wiki/Median_absolute_deviation

but in few words Median absolute deviation is

Consider the data (1, 1, 2, 2, 4, 6, 9). It has a median value of 2. The absolute deviations around 2 are (1, 1, 0, 0, 2, 4, 7) which in turn have a median value of 1 (because the sorted absolute deviations are (0, 0, 1, 1, 2, 4, 7)). So the median absolute deviation for this data is 1.

Now, in the example above, I have used a small data set, however the real set has over 1 000 000 rows and I would like to calculate Median absolute deviation for every 10 rows ( i.e. using something like over (order by [id] asc
rows between 9 preceding and current row))..

any idea's? thanks

leshka_uk
Starting Member

3 Posts

Posted - 2014-06-09 : 13:50:08
this is quick and dirty solution working which needs to be improved


DECLARE @from int
DECLARE @step int
DECLARE @median decimal(8,5)
DECLARE @mad decimal(8,5)

SET @from=1
SET @step=7

DECLARE @Foo TABLE
(
id INT NOT NULL,
x INT NOT NULL
)

INSERT @Foo
VALUES (1,1),
(2,1),
(3,2),
(4,2),
(5,4),
(6,6),
(7,9)


DECLARE @counter int
SET @counter=(SELECT count(*) FROM @Foo)
select @counter

WHILE (@from < @counter)
BEGIN

DROP TABLE #tmp
SELECT id,x INTO #tmp
FROM (
SELECT id, x, ROW_NUMBER() OVER (ORDER BY id) AS RowNum
FROM @Foo) AS T
WHERE RowNum >= @from AND RowNum < @from+@step

SET @median=(SELECT AVG(1.0E * x)
FROM (
SELECT x,
2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS y
FROM #tmp
) AS d
WHERE y BETWEEN 0 AND 2
)

DROP TABLE #tmp2
select id,x,
CASE
WHEN x-@median<0 THEN ( x-@median)*-1
ELSE x-@median
END AS [x_prime]
into #tmp2
from #tmp

SET @mad=(SELECT AVG(1.0E * [x_prime])
FROM (
SELECT [x_prime],
2 * ROW_NUMBER() OVER (ORDER BY x_prime) - COUNT(*) OVER () AS y
FROM #tmp2
) AS d
WHERE y BETWEEN 0 AND 2
)

select * from #tmp2
SELECT @median AS MEDIAN, @mad AS MAD


SET @from=@from+1
END

Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2014-06-10 : 14:03:45
A set-based approach
DECLARE	@Sequence INT = 2;

DECLARE @Sample TABLE
(
ID INT NOT NULL,
x FLOAT NOT NULL
);

INSERT @Sample
(
ID,
x
)
VALUES (0, 11),
(2, 12),
(4, 21),
(5, 22),
(7, 41),
(8, 61),
(9, 91);

WITH cteMedian
AS (
SELECT (ROW_NUMBER() OVER (ORDER BY ID) - 1) / @Sequence AS seq,
ID,
x
FROM @Sample
)
SELECT seq,
MIN(ID) AS minID,
MAX(ID) AS maxID,
AVG(x) AS Median
INTO #Median
FROM (
SELECT seq,
ID,
x,
2 * ROW_NUMBER() OVER (PARTITION BY seq ORDER BY x) - COUNT(*) OVER (PARTITION BY seq) AS y
FROM cteMedian
) AS d
WHERE y BETWEEN 0 AND 2
GROUP BY seq;

WITH cteDeviation
AS (
SELECT (ROW_NUMBER() OVER (ORDER BY ID) - 1) / @Sequence AS seq,
ID,
x
FROM @Sample
)
SELECT seq,
MIN(minID) AS minID,
MAX(maxID) AS maxID,
MIN(Median) AS Median,
AVG(ABS(x - Median)) AS Deviation
FROM (
SELECT d.seq,
d.ID,
d.x,
2 * ROW_NUMBER() OVER (PARTITION BY d.seq ORDER BY ABS(d.x - m.Median)) - COUNT(*) OVER (PARTITION BY d.seq) AS y,
m.Median,
m.minID,
m.maxID
FROM cteDeviation AS d
INNER JOIN #Median AS m ON m.seq = d.seq
) AS d
WHERE y BETWEEN 0 AND 2
GROUP BY seq
ORDER BY seq;

DROP TABLE #Median;



Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA
Go to Top of Page

leshka_uk
Starting Member

3 Posts

Posted - 2014-06-11 : 06:54:11
quote:
Originally posted by SwePeso

A set-based approach[code]DECLARE @Sequence INT = 2;





Great!! Liked your approach. but if we have group of 2 elements (i.e.; DECLARE @Sequence INT =2;), then you would need to group your data using the following approach.

G1
(0, 11),
(2, 12),

G2
(2, 12),
(4, 21)

G3
(4, 21),
(5, 22)

G4
(5, 22),
(7, 41)

G5
(7, 41),
(8, 61)

G6
(8, 61),
(9, 91)

also to expand the problem, it would be good to compare value of x with lead(x) & lag(x) ans whichever is smaller we compare with some discrepancy threshold and if above then display 'False' in the output set.

For example consider grouping by 2 elements (example above) and threshold is set to 4 THEN

subset 1
(0, 11)
(2, 12)
(4, 21)

- deviation statistics computed using only 2 elements: (0, 11) & (2, 12))
- comparison statistics is computed using 3 elements (lag, x,lead)

approach

lag(x)=11
x=12
lead(x)=21

check 1: lag(x)=11 < lead(x)=21 so we use lead for the next step
check 2: ABS(x-lead(x))> 4, so we display 'False'

so the result output for subset 1 is

id, x, deviation, x_prime
0, 11, 0, 'False'
2, 12, 0, 'False'
4, 21, 0, 'False'

subset 2
(5, 14)
(6, 12)
(7, 11)

- deviation statistics computed using only 2 elements: (5, 14) & (6, 12))
- comparison statistics is computed using 3 elements (lag, x,lead)

lag(x)=14
x=12
lead(x)=11

check 1: lag(x)=14 > lead(x)=11 so we use lag for the next step
check 2: ABS(x-lag(x))< 4, so we display actual x

so the result output for subset 2 is
id, x, deviation, x_prime
5, 14, 1, 14
6, 12, 1, 12
7, 11, 1, 11

Note: first row will not have lag and the last row will not have lead.

so the comparison statistics is made using the simplified approach
first element : ABS(x-lead(x))< [threshold]
last element : ABS(x-lag(x))< [threshold]

thanks for idea's!!!


Go to Top of Page
   

- Advertisement -