Splitting Ordered Groups into Ordered Subgroups

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Transact-SQL (2008)

Splitting Ordered Groups into Ordered Subgroups

Author

Topic

LaurieCox

158 Posts

Posted - 2012-08-17 : 13:01:53

I discovered an anomaly in the solution given by sunitabeck when I ran his solution against actual data rather than the test data. See my new post below.

I have the following table:


CREATE TABLE TestTable(
	ODE_Id int NOT NULL,
	GroupId int NULL,
	OrderNumber int NULL,
	event varchar(10) NULL,
	DrugName varchar(20) NULL,
	NumDosages int NULL  
   )

(See end of post for data statements)

A group denotes a series of events. The first event in each group is Active followed by a series of Change and Order events. The order of events within a group is denoted by OrderNumber.

My problem is that all groups must only reference a single drug name and I have found groups where the drug name changes. I therefore must split these groups into sub groups when the drug name changes.

So given this data for three groups:


ODE_Id      GroupId     OrderNumber event      DrugName             NumDosages  
----------- ----------- ----------- ---------- -------------------- ----------- 
29199       3191        1           Active     Trazodone (Bulk)     1           
29201       3191        2           Change     Trazodone            1           
29203       3191        3           Reorder    Trazodone            1           
29207       3191        4           Reorder    Trazodone            1           
29208       3191        5           Change     Trazodone            1           

20156       23523       1           Active     Wellbutrin SR        1           
20158       23523       2           Reorder    Wellbutrin SR        1           
20159       23523       3           Change     Wellbutrin           1           
20157       23523       4           Change     Wellbutrin SR        1           

24138       8615        1           Active     Hydroxyzine HCl      2           
24139       8615        1           Active     Hydroxyzine HCl      2           
24136       8615        2           Change     Vistaril             2           
24137       8615        2           Change     Vistaril             2           
24134       8615        3           Reorder    Vistaril             2           
24135       8615        3           Reorder    Vistaril             2

Group 3191 needs to be split into 2 groups.
Group 23523 needs to be split into 3 groups.
Group 8615 needs to be split into 2 groups. This group also has multi dosages orders. This means that for a given OrderNumber there is more than one record (number of records will equal NumDosages).

So what I would like to do is add a column to the table called SubGroup and then write a query that would assign SubGroup numbers to split the groups.

So the result for the example given above could be:


ODE_Id      GroupId     OrderNumber event      DrugName             NumDosages  SubGroup
----------- ----------- ----------- ---------- -------------------- ----------- -----------
29199       3191        1           Active     Trazodone (Bulk)     1           1
29201       3191        2           Change     Trazodone            1           2
29203       3191        3           Reorder    Trazodone            1           2
29207       3191        4           Reorder    Trazodone            1           2
29208       3191        5           Change     Trazodone            1           2

20156       23523       1           Active     Wellbutrin SR        1           3
20158       23523       2           Reorder    Wellbutrin SR        1           3
20159       23523       3           Change     Wellbutrin           1           4
20157       23523       4           Change     Wellbutrin SR        1           5

24138       8615        1           Active     Hydroxyzine HCl      2           6
24139       8615        1           Active     Hydroxyzine HCl      2           6
24136       8615        2           Change     Vistaril             2           7
24137       8615        2           Change     Vistaril             2           7
24134       8615        3           Reorder    Vistaril             2           7
24135       8615        3           Reorder    Vistaril             2           7

Notes on the values in SubGroup column:

Within a Group the SubGroup numbers should be ordered so that if I sort the group by SubGroup, OrderNumber the records will order correctly.

SubGroup should be unique across groups. So from the example above if Group 3191 has SubGroups 1 and 2 no other group should have a SubGroup of 1 or 2.

Other than the preceding two rules I don't care what SubGroup values any given group is assigned.

At the moment I am not really sure where to start. I will of course be working on the problem and will post any attempts that I come up with.

Laurie

Test Data (it includes the examples given above plus more groups)


insert into TestTable
Select 23989,1985,1,'Active','Hydroxyzine HCl',1 Union All
Select 23990,1985,2,'Change','Vistaril',1 Union All
Select 23991,1985,3,'Discontinu','Vistaril',1 Union All
Select 29199,3191,1,'Active','Trazodone (Bulk)',1 Union All
Select 29201,3191,2,'Change','Trazodone',1 Union All
Select 29203,3191,3,'Reorder','Trazodone',1 Union All
Select 29207,3191,4,'Reorder','Trazodone',1 Union All
Select 29208,3191,5,'Change','Trazodone',1 Union All
Select 3820,5423,1,'Active','Hydroxyzine HCl',1 Union All
Select 3954,5423,2,'Reorder','Hydroxyzine HCl',1 Union All
Select 3979,5423,3,'Reorder','Hydroxyzine HCl',1 Union All
Select 4054,5423,4,'Change','Vistaril',2 Union All
Select 4055,5423,4,'Change','Vistaril',2 Union All
Select 4114,5423,5,'Reorder','Vistaril',2 Union All
Select 4115,5423,5,'Reorder','Vistaril',2 Union All
Select 24138,8615,1,'Active','Hydroxyzine HCl',2 Union All
Select 24139,8615,1,'Active','Hydroxyzine HCl',2 Union All
Select 24136,8615,2,'Change','Vistaril',2 Union All
Select 24137,8615,2,'Change','Vistaril',2 Union All
Select 24134,8615,3,'Reorder','Vistaril',2 Union All
Select 24135,8615,3,'Reorder','Vistaril',2 Union All
Select 3927,10231,1,'Active','Hydroxyzine HCl',2 Union All
Select 3928,10231,1,'Active','Hydroxyzine HCl',2 Union All
Select 3962,10231,2,'Change','Vistaril',2 Union All
Select 3963,10231,2,'Change','Vistaril',2 Union All
Select 4024,10231,3,'Reorder','Vistaril',2 Union All
Select 4025,10231,3,'Reorder','Vistaril',2 Union All
Select 26016,12911,1,'Active','Hydroxyzine HCl',1 Union All
Select 26014,12911,2,'Reorder','Hydroxyzine HCl',1 Union All
Select 26015,12911,3,'Change','Vistaril',1 Union All
Select 26011,12911,4,'Reorder','Vistaril',1 Union All
Select 26012,12911,5,'Change','Hydroxyzine HCl',1 Union All
Select 26013,12911,6,'Change','Hydroxyzine HCl',1 Union All
Select 26017,12911,7,'Reorder','Hydroxyzine HCl',1 Union All
Select 3843,15582,1,'Active','Hydroxyzine HCl',1 Union All
Select 3812,15582,2,'Change','Vistaril',1 Union All
Select 3813,15582,3,'Discontinu','Vistaril',1 Union All
Select 20156,23523,1,'Active','Wellbutrin SR',1 Union All
Select 20158,23523,2,'Reorder','Wellbutrin SR',1 Union All
Select 20159,23523,3,'Change','Wellbutrin',1 Union All
Select 20157,23523,4,'Change','Wellbutrin SR',1

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2012-08-17 : 13:25:59

[code]
select *, dense_rank() over (ORDER BY GroupId,DrugName) AS SubGroup
FROM TestTable
[/code]

------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-17 : 14:03:16

May be this - but it may not order correctly in your second example (GroupId = 23523) where the DrugName changes to something and changes back to the original again. But if you order by GroupId and OrderNumber it would/should order correctly.

SELECT
	*,
	ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)-
	ROW_NUMBER() OVER (PARTITION BY drugname,groupId ORDER BY groupid,ordernumber) AS SubGroup
FROM
	testtable
ORDER BY
	groupid, ordernumber;

LaurieCox

158 Posts

Posted - 2012-08-17 : 14:47:35

Hi visakh16, sunitabeck

Thanks for the replies. Unfortunately visakh16, your solution does not work for where the drug name changes and then changes back and also does not necessarily order the SubGroups correctly

Sunitabeck yours does divide the groups into the right number of groups but does not (as you pointed out) order SubGroups correctly.

This may or may not be a problem. Because this was only step one in the process. After I have computed the SubGroup I need to do the following:

Renumber the OrderNumber column so that it starts at one for each SubGroup

Change the first event in each SubGroup to Active

Change the last event in each SubGroup (except for the last Subgroup within a group) to Discontinu

Update a comment column (I didn't include it in my TestTable as it wasn't relevant for this initial step) for the first record (determined by OrderNumber) within a given SubGroup that follows a SubGroup (within a given group) that says that this is a continuation of the preceding SubGroup

Step 3 and Step 4 both require that I know the proper order of the SubGroups within a group. I even have code that does this for a similar problem and it depends on the SubGroups being ordered correctly. I think I will work on this code and post it here as somebody maybe able to modify it so it does not depend on SubGroup order (though I am not sure how that would be possible).

Plus I will read up on both dense_rank and partitions.

Again thanks for the help,

Laurie

Edited because when I added a new step 1 to the list above I didn't change the references to the steps in the following text.

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-17 : 15:34:00

The four updates you described seem to be easily doable based on the ordering of the current OrderNumber; so you won't need to rely on the ordering of the SubGroup.

If you really want to, you can get the SubGroup to be ordered as you described - for example like shown below, but as you can see, that requires a whole another sub-query (actually OUTER APPLY) and resulting performance hit.

SELECT
	t1.*,
	ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)-
	ROW_NUMBER() OVER (PARTITION BY drugname,groupId,COALESCE(N,0) ORDER BY groupid,ordernumber) AS SubGroup
FROM
	testtable t1
	OUTER APPLY
	(
		SELECT TOP 1 
			OrderNumber AS N
		FROM 
			testtable t2 
		WHERE
			t2.GroupId = t1.GroupId
			AND t2.OrderNumber < t1.OrderNumber
			AND t2.DrugName <> t1.DrugName
		ORDER BY 
			t2.orderNumber DESC 
	) t2
ORDER BY
	groupid, ordernumber;

LaurieCox

158 Posts

Posted - 2012-08-17 : 15:52:24

Hi sunitabeck,

Thanks for your reply. That looks like it works perfectly. I will probably go with it as this is a one time thing. I am converting data from one medication prescribing system to another, so I am more interested in saving my development time than processing time. As I said I already have the script written that assumes that the SubGroups are ordered for a similar problem.

Plus it has the added benefit of adding to the list of problems that I posted here that turned out to have the cross and/or outer apply clause as part of the solution.

Again thank you,

Laurie

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-17 : 15:58:37

Glad it worked out - I can relate very well to "saving developer time rather than computer time" :)

LaurieCox

158 Posts

Posted - 2012-10-12 : 15:07:14

I discovered an anomaly in the solution given by sunitabeck above when I ran it against actual data rather than the test data given in the op. I have figured out where the problem was occurring and how to fix, but not exactly why. I am interested in the why.

The problem is when running the solution above against the actual data (30000 rows/55 columns) I get a different result for the SubGroup value if my select statement selects on all columns (t1.* - which gives me the 'correct' values) than if it selects on a subset of the columns (which gives me incorrect values).

So in order to figure out what was happening I modified sunitabeck's solution to give me separate row_number values that make up the SubGroup field (GroupIdRN - MedNameRN):


SELECT t1.*
	, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber) as GroupIdRN
	, ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0) 
	     ORDER BY groupid,ordernumber) as MedNameRn
	, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)-
 	    ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0) 
 	     ORDER BY groupid,ordernumber) AS SubGroup
INTO ISC_SplitResultsAllColumns 
FROM
	ISC_ODE_NotProcessedBeforeSplitChainUpdate t1
	OUTER APPLY
	(
		SELECT TOP 1 
			OrderNumber AS N
		FROM 
			ISC_ODE_NotProcessedBeforeSplitChainUpdate t2 
		WHERE
			t2.GroupId = t1.GroupId
			AND t2.OrderNumber < t1.OrderNumber
			AND t2.MedicationNameId <> t1.MedicationNameId
		ORDER BY 
			t2.orderNumber DESC 
	) t2
ORDER BY
	groupid, ordernumber;

When I select from the solution table for just the interesting columns I get this result:


GroupId OrderNumber event      NumDosages  MedicationNameId ODE_Id GroupIdRN  MedNameRn SubGroup
------- ----------- ---------- ----------- ---------------- ------ ---------- --------- --------
8615    1           Active     2           3721             24138  6981       1         6980
8615    1           Active     2           3721             24139  6982       2         6980

8615    2           Change     2           53               24136  6983       1         6982
8615    2           Change     2           53               24137  6984       2         6982
8615    3           Reorder    2           53               24134  6985       3         6982
8615    3           Reorder    2           53               24135  6986       4         6982
8615    4           Reorder    2           53               24130  6987       5         6982
8615    4           Reorder    2           53               24132  6988       6         6982
8615    5           Discontinu 2           53               24133  6989       7         6982
8615    5           Discontinu 2           53               24131  6990       8         6982

This successfully split the group into two groups; splitting them when the MedicationNameId changed.

But when I run this query (selecting only on a few of the columns):


SELECT t1.GroupId
	, t1.OrderNumber
	, t1.event
	, t1.NumDosages
	, t1.MedicationNameId
	, t1.ODE_Id
	, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber) as GroupIdRN
	, ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0) 
	          ORDER BY groupid,ordernumber) as MedNameRN
	, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)-
 	     ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0) 
 	          ORDER BY groupid,ordernumber) AS SubGroup
into ISC_SplitResultsSelectedColumns
FROM ISC_ODE_NotProcessedBeforeSplitChainUpdate t1
	OUTER APPLY
	(
		SELECT TOP 1 
			OrderNumber AS N
		FROM 
			ISC_ODE_NotProcessedBeforeSplitChainUpdate t2 
		WHERE
			t2.GroupId = t1.GroupId
			AND t2.OrderNumber < t1.OrderNumber
			AND t2.MedicationNameId <> t1.MedicationNameId
		ORDER BY 
			t2.orderNumber DESC 
	) t2
ORDER BY
	groupid, ordernumber;

I get this result:


GroupId OrderNumber event      NumDosages  MedicationNameId ODE_Id GroupIdRN MedNameRN SubGroup
------- ----------- ---------- ----------- ---------------- ------ --------- --------- --------
8615    1           Active     2           3721             24139  6982      2         6981

8615    1           Active     2           3721             24138  6981      1         6979

8615    2           Change     2           53               24136  6983      2         6982
8615    2           Change     2           53               24137  6984      1         6982
8615    3           Reorder    2           53               24135  6986      4         6982
8615    3           Reorder    2           53               24134  6985      3         6982
8615    4           Reorder    2           53               24132  6987      6         6982
8615    4           Reorder    2           53               24130  6988      5         6982
8615    5           Discontinu 2           53               24133  6990      8         6982
8615    5           Discontinu 2           53               24131  6989      7         6982

Splitting it into three groups. The difference is found in the GroupIdRN and MedNameRN columns:


All columns selected:
ODE_Id GroupIdRN  MedNameRn SubGroup
------ ---------- --------- --------
24138  6981       1         6980
24139  6982       2         6980


Columns restricted:
ODE_Id GroupIdRN MedNameRN SubGroup
------ --------- --------- --------
24139  6982      2         6981
24138  6981      1         6979

SubGroup should be calculated by GroupIdRN – MedNameRN. But when I select on the restricted columns the calculations, don't seem to be correct (6982 – 2 does not equal 6981). This problem only happened when there was more than one record for a given GroupId/OrderNumber pairing; which occurs if the NumDosages column is greater than 1 (one record for each dosage).

My solution was to change row_number to dense_rank:


SELECT t1.MyIdentity
    ,  t1.GroupId
	, t1.OrderNumber
	, t1.event
	, t1.NumDosages
	, t1.MedicationNameId
	, t1.ODE_Id
	, DENSE_RANK() OVER (ORDER BY groupid,ordernumber) as GroupIdRN
	, DENSE_RANK() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0) 
	     ORDER BY groupid,ordernumber) as MedNameRN
	, DENSE_RANK() OVER (ORDER BY groupid,ordernumber)-
 	    DENSE_RANK() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0) 
 	     ORDER BY groupid,ordernumber) AS SubGroup
INTO ISC_SplitResultsDenseRank
FROM
	ISC_ODE_NotProcessedBeforeSplitChainUpdate t1
	OUTER APPLY
	(
		SELECT TOP 1 
			OrderNumber AS N
		FROM 
			ISC_ODE_NotProcessedBeforeSplitChainUpdate t2 
		WHERE
			t2.GroupId = t1.GroupId
			AND t2.OrderNumber < t1.OrderNumber
			AND t2.MedicationNameId <> t1.MedicationNameId
		ORDER BY 
			t2.orderNumber DESC 
	) t2
ORDER BY
	groupid, ordernumber;

Which gives this result:


GroupId OrderNumber event      NumDosages  MedicationNameId ODE_Id      GroupIdRN MedNameRN SubGroup
------- ----------- ---------- ----------- ---------------- ----------- --------- --------- --------
8615    1           Active     2           3721             24139       5934      1         5933
8615    1           Active     2           3721             24138       5934      1         5933

8615    2           Change     2           53               24136       5935      1         5934
8615    2           Change     2           53               24137       5935      1         5934
8615    3           Reorder    2           53               24135       5936      2         5934
8615    3           Reorder    2           53               24134       5936      2         5934
8615    4           Reorder    2           53               24132       5937      3         5934
8615    4           Reorder    2           53               24130       5937      3         5934
8615    5           Discontinu 2           53               24133       5938      4         5934
8615    5           Discontinu 2           53               24131       5938      4         5934

Which had the added advantage of having MedNameRN be the valid OrderNumber for the new groups.

So my question is: Why is the calculation wrong when I select on a subset of the columns? I think having an understanding of the why would help in my quest to become more proficient/knowledgeable about sql and the how the ranking functions work.

I cannot give you test data as the anomaly only occurs if I ran the solution on the whole table (30000 rows and 55 columns).

p.s.

When I discovered the problem instead of adding to this topic I opened a second thread Columns selected changes results huh? about the anomaly. This netted me only one (rightfully confused) response. I never got back to either thread because I was very busy and the solution mostly worked.

Edit: Linked to wrong thread.

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2012-10-12 : 17:11:23

Hi there AlgaeSea!

This isn't a thorough, super detailed response here, but I think I see the problem and it's a universal truth about RDBMSes: SQL Server can only guarantee EXPLICIT ordering. Since T-SQL is a DECLARED language, the optimizer is free to structure the execution plan and data access however it sees best in order to satisfy your requirements. You in fact told it that you really didn't care how it ordered matching records, so you left it free to do so in whatever manner made the most sense performance wise.

Let's take a look at the windowing function you use for GroupIdRN:

ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)

When it hits your example (GroupID 8615, OrderNumber 1), it sees two rows that look the same. Since the expression is only evaluating these two columns, it makes no difference which receives 1 and which receives 2 to the engine. In most cases, this will simply be assigned according to the internal order of the recordset. I'm throwing out a hypothesis here, but when you select the whole recordset it might be using the clustered index versus a the subset which may be taking advantage of a nonclustered index. There could be other reasons for this as well, such as if any sorting steps that are required in the execution plan to satisfy the OUTER APPLY.

Also, keep in mind that this function is evaluated and operating independently of the same code used to calculate SubGroup (which is why you see differences between the two). It is certainly unexpected to see them differ, but it's not out of the realm of possibility.

A way you could correct this behavior is if you could use a unique tie breaker -- let's say a surrogate primary key such as an IDENTITY:

ROW_NUMBER() OVER (ORDER BY groupid,ordernumber,ODE_Id)

That will ensure the result is repeatable since there will always be an explicit order even in the case of ties.

One other thing: there are many times you'll solutions on these boards that don't take this principal into consideration. I am certainly guilty of this as well (see my string concatenation method for "SQL Script to compare Database"), but in many scenarios the internal order can be consistent if not guaranteed. Just something to be aware of.

LaurieCox

158 Posts

Posted - 2012-10-15 : 13:41:49

Hi lazerath,

From my reading about the row_number function I figured it was non-deterministic for equal rows and you are right breaking the tie by adding ODE_id to the order by clauses works. It is still not as good a fix as using the dense_rank function (especially since the MedNameRN column gives me the correct order_number for the new split groups).

I still find it interesting that limiting the number of extra columns selected would change how it works (I guess that's what's meant by non-deterministic).

It is also obvious that pulling the parts of the equation out to try to debug did not really help; as it obviously 'used' different rows when getting the parts for calculating SubGroup as which rows it 'used' for the MedNameRN and GroupIdRN columns

Anyway adding the ODE_Id:


SELECT t1.GroupId
	, t1.OrderNumber
	, t1.event
	, t1.NumDosages
	, t1.MedicationNameId
	, t1.ODE_id
	, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber,ODE_id) as GroupIdRN
	, ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0) 
	          ORDER BY groupid,ordernumber,ODE_id) as MedNameRN
	, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber,ODE_id)-
 	     ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0) 
 	          ORDER BY groupid,ordernumber,ODE_id) AS SubGroup
FROM ISC_ODE_NotProcessedBeforeSplitChainUpdate t1
	OUTER APPLY
	(
		SELECT TOP 1 
			OrderNumber AS N
		FROM 
			ISC_ODE_NotProcessedBeforeSplitChainUpdate t2 
		WHERE
			t2.GroupId = t1.GroupId
			AND t2.OrderNumber < t1.OrderNumber
			AND t2.MedicationNameId <> t1.MedicationNameId
		ORDER BY 
			t2.orderNumber DESC 
	) t2
ORDER BY
	groupid, ordernumber;

Netted me this:


GroupId OrderNumber event      NumDosages  MedicationNameId ODE_Id      GroupIdRN MedNameRN SubGroup
------- ----------- ---------- ----------- ---------------- ----------- --------- --------- --------
8615    1           Active     2           3721             24138       6981      1         6980
8615    1           Active     2           3721             24139       6982      2         6980

8615    2           Change     2           53               24136       6983      1         6982
8615    2           Change     2           53               24137       6984      2         6982
8615    3           Reorder    2           53               24134       6985      3         6982
8615    3           Reorder    2           53               24135       6986      4         6982
8615    4           Reorder    2           53               24130       6987      5         6982
8615    4           Reorder    2           53               24132       6988      6         6982
8615    5           Discontinu 2           53               24131       6989      7         6982
8615    5           Discontinu 2           53               24133       6990      8         6982

But I still prefer the dense_rank solution.

But it has been fun figuring things out and I learned more about the window functions.

Thank you for your response,

Laurie

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources