Author |
Topic |
leshka_uk
Starting Member
3 Posts |
Posted - 2014-06-09 : 11:45:23
|
Dear all, I am trying to find the most efficient way (i.e. avoiding cursors and tmp tables) to calculate Median absolute deviation in a set with window ordering (i.e. breaking long dataset into groups of 10 rows. group 1 -> row 1 to 10; group 2 -> row 2 to 11; etc.). background reading about Median absolute deviationhttp://en.wikipedia.org/wiki/Median_absolute_deviationbut in few words Median absolute deviation is Consider the data (1, 1, 2, 2, 4, 6, 9). It has a median value of 2. The absolute deviations around 2 are (1, 1, 0, 0, 2, 4, 7) which in turn have a median value of 1 (because the sorted absolute deviations are (0, 0, 1, 1, 2, 4, 7)). So the median absolute deviation for this data is 1.Now, in the example above, I have used a small data set, however the real set has over 1 000 000 rows and I would like to calculate Median absolute deviation for every 10 rows ( i.e. using something like over (order by [id] ascrows between 9 preceding and current row)).. any idea's? thanks |
|
leshka_uk
Starting Member
3 Posts |
Posted - 2014-06-09 : 13:50:08
|
this is quick and dirty solution working which needs to be improvedDECLARE @from intDECLARE @step intDECLARE @median decimal(8,5)DECLARE @mad decimal(8,5)SET @from=1 SET @step=7DECLARE @Foo TABLE ( id INT NOT NULL, x INT NOT NULL ) INSERT @Foo VALUES (1,1), (2,1), (3,2), (4,2), (5,4), (6,6), (7,9) DECLARE @counter int SET @counter=(SELECT count(*) FROM @Foo) select @counter WHILE (@from < @counter) BEGIN DROP TABLE #tmp SELECT id,x INTO #tmp FROM ( SELECT id, x, ROW_NUMBER() OVER (ORDER BY id) AS RowNum FROM @Foo) AS T WHERE RowNum >= @from AND RowNum < @from+@step SET @median=(SELECT AVG(1.0E * x) FROM ( SELECT x, 2 * ROW_NUMBER() OVER (ORDER BY x) - COUNT(*) OVER () AS y FROM #tmp ) AS d WHERE y BETWEEN 0 AND 2 ) DROP TABLE #tmp2 select id,x, CASE WHEN x-@median<0 THEN ( x-@median)*-1 ELSE x-@median END AS [x_prime] into #tmp2 from #tmp SET @mad=(SELECT AVG(1.0E * [x_prime]) FROM ( SELECT [x_prime], 2 * ROW_NUMBER() OVER (ORDER BY x_prime) - COUNT(*) OVER () AS y FROM #tmp2 ) AS d WHERE y BETWEEN 0 AND 2 ) select * from #tmp2 SELECT @median AS MEDIAN, @mad AS MAD SET @from=@from+1 END |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2014-06-10 : 14:03:45
|
A set-based approachDECLARE @Sequence INT = 2;DECLARE @Sample TABLE ( ID INT NOT NULL, x FLOAT NOT NULL );INSERT @Sample ( ID, x )VALUES (0, 11), (2, 12), (4, 21), (5, 22), (7, 41), (8, 61), (9, 91);WITH cteMedianAS ( SELECT (ROW_NUMBER() OVER (ORDER BY ID) - 1) / @Sequence AS seq, ID, x FROM @Sample)SELECT seq, MIN(ID) AS minID, MAX(ID) AS maxID, AVG(x) AS MedianINTO #MedianFROM ( SELECT seq, ID, x, 2 * ROW_NUMBER() OVER (PARTITION BY seq ORDER BY x) - COUNT(*) OVER (PARTITION BY seq) AS y FROM cteMedian ) AS dWHERE y BETWEEN 0 AND 2GROUP BY seq;WITH cteDeviationAS ( SELECT (ROW_NUMBER() OVER (ORDER BY ID) - 1) / @Sequence AS seq, ID, x FROM @Sample)SELECT seq, MIN(minID) AS minID, MAX(maxID) AS maxID, MIN(Median) AS Median, AVG(ABS(x - Median)) AS DeviationFROM ( SELECT d.seq, d.ID, d.x, 2 * ROW_NUMBER() OVER (PARTITION BY d.seq ORDER BY ABS(d.x - m.Median)) - COUNT(*) OVER (PARTITION BY d.seq) AS y, m.Median, m.minID, m.maxID FROM cteDeviation AS d INNER JOIN #Median AS m ON m.seq = d.seq ) AS dWHERE y BETWEEN 0 AND 2GROUP BY seqORDER BY seq;DROP TABLE #Median; Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA |
 |
|
leshka_uk
Starting Member
3 Posts |
Posted - 2014-06-11 : 06:54:11
|
quote: Originally posted by SwePeso A set-based approach[code]DECLARE @Sequence INT = 2;
Great!! Liked your approach. but if we have group of 2 elements (i.e.; DECLARE @Sequence INT =2;), then you would need to group your data using the following approach. G1 (0, 11), (2, 12), G2 (2, 12), (4, 21) G3 (4, 21), (5, 22) G4 (5, 22), (7, 41) G5 (7, 41), (8, 61) G6 (8, 61), (9, 91)also to expand the problem, it would be good to compare value of x with lead(x) & lag(x) ans whichever is smaller we compare with some discrepancy threshold and if above then display 'False' in the output set.For example consider grouping by 2 elements (example above) and threshold is set to 4 THENsubset 1 (0, 11)(2, 12)(4, 21)- deviation statistics computed using only 2 elements: (0, 11) & (2, 12))- comparison statistics is computed using 3 elements (lag, x,lead) approachlag(x)=11x=12lead(x)=21check 1: lag(x)=11 < lead(x)=21 so we use lead for the next stepcheck 2: ABS(x-lead(x))> 4, so we display 'False'so the result output for subset 1 isid, x, deviation, x_prime0, 11, 0, 'False'2, 12, 0, 'False'4, 21, 0, 'False'subset 2 (5, 14)(6, 12)(7, 11)- deviation statistics computed using only 2 elements: (5, 14) & (6, 12))- comparison statistics is computed using 3 elements (lag, x,lead) lag(x)=14x=12lead(x)=11check 1: lag(x)=14 > lead(x)=11 so we use lag for the next stepcheck 2: ABS(x-lag(x))< 4, so we display actual xso the result output for subset 2 isid, x, deviation, x_prime5, 14, 1, 146, 12, 1, 127, 11, 1, 11Note: first row will not have lag and the last row will not have lead.so the comparison statistics is made using the simplified approach first element : ABS(x-lead(x))< [threshold] last element : ABS(x-lag(x))< [threshold] thanks for idea's!!! |
 |
|
|
|
|