Author 
Topic 

towardabettercountry
Starting Member
26 Posts 
Posted  01/22/2013 : 12:24:06

Hi All,
I'm trying to calculate median, lower quartile, and upper quartile for a box plot chart using SQL.
Here is some sample data:
CREATE TABLE EStats ( PersonID VARCHAR(30) NOT NULL, Grade VARCHAR(25) NOT NULL, CourseDate Date NOT NULL ) ;
INSERT INTO EStats ( PersonID, Grade, CourseDate )
VALUES ('100', '91', '20100301'), ('101', '96', '20100301'), ('102', '88', '20100301'), ('103', '92', '20100301'), ('104', '81', '20100301'), ('105', '85', '20100301'), ('106', '91', '20100301'), ('107', '89', '20100301'), ('108', '99', '20100301'), ('109', '88', '20100301'), ('110', '81', '20110302'), ('111', '77', '20110302'), ('112', '88', '20110302'), ('113', '76', '20110302'), ('114', '69', '20110302'), ('115', '70', '20110302'), ('116', '75', '20110302'), ('117', '88', '20110302'), ('118', '76', '20110302'), ('119', '95', '20120301'), ('120', '96', '20120301'), ('121', '90', '20120301'), ('122', '80', '20120301'), ('123', '85', '20120301'), ('124', '94', '20120301'), ('125', '89', '20120301'), ('126', '97', '20120301'), ('127', '94', '20120301'), ('128', '72', '20120301'), ('129', '88', '20120301'), ('130', '91', '20120301')
This query calculates the median:
SELECT * FROM EStats
SELECT CourseDate, AVG(CAST(Grade AS Numeric)) AS Median FROM ( SELECT CourseDate, Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats )x WHERE RowAsc IN (RowDesc, RowDesc  1, RowDesc + 1) GROUP BY CourseDate ORDER BY CourseDate
However, I don't know how to calculate lower and upper quartiles (in order to put in a box plot or boxandwhisker chart). Can anyone help? 

James K
Flowing Fount of Yak Knowledge
3873 Posts 

towardabettercountry
Starting Member
26 Posts 
Posted  01/22/2013 : 14:19:14

Hi James,
Thanks for the reply. I've played around some with NTILE regarding this solution, and can't seem to get it to work. As I understand it the lower and upper quartiles for box plots are the median of the two halves. I wasn't sure how to get NTILE to deal with variables such as when there are an even amount of numbers.
Any help (especially an example!) would be great!! 


James K
Flowing Fount of Yak Knowledge
3873 Posts 
Posted  01/22/2013 : 15:08:18

If the groups are not divisible by 4, the distribution will not be even. The MSDN example is NTILE(5) when there are 53 rows. First 3 will get 11 and the next two will get 10.
In your example for 2010, where you have 10 rows, what do you want the first quartile boundary to be? 2, 3, 2.5 or something else? 


towardabettercountry
Starting Member
26 Posts 
Posted  01/22/2013 : 15:23:29

Okay, I think I am moving toward a solution. The below is ugly and cumbersome, but its the best I've got right now. However, I noticed my median in off in some places. When I researched the problem, I found that the row_number in my inner query doesn't handle "ties" very well. in other words, it may have something like this:
rn grade 1 88 2 89 4 90 3 90 5 95
You can run this to see what I'm talking about:
SELECT CourseDate, Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats
Notice for CourseDate 20100301 the row_number works as follows:
RowAsc 10 9 8 6 7 5 3 4 2 1
Would you mind suggesting how to fix this?
Additionally, here is what I have that calculates everything for a boxplot. Once I get this row_number issue fixed, it should work.
WITH Q3 AS ( SELECT CourseDate, AVG(CAST(Grade AS Numeric)) AS Median FROM ( SELECT CourseDate, Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats )x WHERE RowAsc IN (RowDesc, RowDesc  1, RowDesc + 1) GROUP BY CourseDate ORDER BY CourseDate ),
Q2 AS ( SELECT x.CourseDate, AVG(CAST(Grade AS Numeric)) AS LowerQuartile FROM ( SELECT Estats.CourseDate, Estats.Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(EStats.CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(Estats.CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate WHERE EStats.Grade < Q3.Median )x WHERE RowAsc IN (RowDesc, RowDesc  1, RowDesc + 1) GROUP BY x.CourseDate ),
Q4 AS ( SELECT x.CourseDate, AVG(CAST(Grade AS Numeric)) AS UpperQuartile FROM ( SELECT Estats.CourseDate, Estats.Grade, ROW_NUMBER() OVER ( PARTITION BY LEFT(EStats.CourseDate, 4) ORDER BY Grade ASC) AS RowAsc, ROW_NUMBER() OVER ( PARTITION BY LEFT(Estats.CourseDate, 4) ORDER BY Grade DESC) AS RowDesc FROM EStats JOIN Q3 on EStats.CourseDate = Q3.CourseDate WHERE EStats.Grade > Q3.Median )x WHERE RowAsc IN (RowDesc, RowDesc  1, RowDesc + 1) GROUP BY x.CourseDate )
SELECT Q3.CourseDate, Q3.Median AS Median, Q2.LowerQuartile, Q4.UpperQuartile, MIN(EStats.Grade) AS Min, MAX(EStats.Grade) AS Max FROM Q3 JOIN Q2 ON Q3.CourseDate = Q2.CourseDate JOIN Q4 ON Q3.CourseDate = Q4.CourseDate JOIN EStats ON Q3.CourseDate = EStats.CourseDate GROUP BY Q3.CourseDate, Q3.Median, Q2.LowerQuartile, Q4.UpperQuartile ORDER BY Q3.CourseDate 


James K
Flowing Fount of Yak Knowledge
3873 Posts 
Posted  01/22/2013 : 17:18:41

Instead of row_number function can you try RANK function or DENSE_RANK function? They work the same way as row_number function, except, would assign the same rank for ties. DENSE_RANK would not leave any gaps, RANK would. Not quite sure which of these would work correctly for you when the value around the median has ties  you will have to experiment. 


towardabettercountry
Starting Member
26 Posts 
Posted  01/22/2013 : 17:54:11

Hi James,
Thanks again for hanging in there with me. Rank and Dense_Rank don't work because the numbering isn't sequential in the case of ties. 


James K
Flowing Fount of Yak Knowledge
3873 Posts 
Posted  01/22/2013 : 19:11:45

To make it consistent for desc and ascending orders, there has to be some other tiebreaker, so the result is determinate. Alternatively, one could do the following  here I am modifying the query that you posted in your last reply: Original query
SELECT
CourseDate,
Grade,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY LEFT(CourseDate, 4)
ORDER BY Grade DESC) AS RowDesc
FROM EStats
 Revised query
;WITH cte AS
(
SELECT *, ROW_NUMBER() OVER (PARTITION BY LEFT(CourseDate, 4) ORDER BY Grade ASC) AS RN
FROM EStats
)
SELECT CourseDate,
Grade,
ROW_NUMBER() OVER(PARTITION BY LEFT(CourseDate, 4) ORDER BY RN ASC) AS
RowAsc,
ROW_NUMBER() OVER(PARTITION BY LEFT(CourseDate, 4) ORDER BY RN DESC) AS
RowDesc
FROM cte



towardabettercountry
Starting Member
26 Posts 
Posted  01/23/2013 : 10:43:12

James,
That is fantastic! Works perfectly, thanks a TON! 


James K
Flowing Fount of Yak Knowledge
3873 Posts 
Posted  01/23/2013 : 13:38:09

You are very welcome  glad to be of help. 



Topic 


