Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Splitting Ordered Groups into Ordered Subgroups

Author  Topic 

LaurieCox

158 Posts

Posted - 2012-08-17 : 13:01:53
I discovered an anomaly in the solution given by sunitabeck when I ran his solution against actual data rather than the test data. See my new post below.

I have the following table:

CREATE TABLE TestTable(
ODE_Id int NOT NULL,
GroupId int NULL,
OrderNumber int NULL,
event varchar(10) NULL,
DrugName varchar(20) NULL,
NumDosages int NULL
)

(See end of post for data statements)

A group denotes a series of events. The first event in each group is Active followed by a series of Change and Order events. The order of events within a group is denoted by OrderNumber.

My problem is that all groups must only reference a single drug name and I have found groups where the drug name changes. I therefore must split these groups into sub groups when the drug name changes.

So given this data for three groups:

ODE_Id GroupId OrderNumber event DrugName NumDosages
----------- ----------- ----------- ---------- -------------------- -----------
29199 3191 1 Active Trazodone (Bulk) 1
29201 3191 2 Change Trazodone 1
29203 3191 3 Reorder Trazodone 1
29207 3191 4 Reorder Trazodone 1
29208 3191 5 Change Trazodone 1

20156 23523 1 Active Wellbutrin SR 1
20158 23523 2 Reorder Wellbutrin SR 1
20159 23523 3 Change Wellbutrin 1
20157 23523 4 Change Wellbutrin SR 1

24138 8615 1 Active Hydroxyzine HCl 2
24139 8615 1 Active Hydroxyzine HCl 2
24136 8615 2 Change Vistaril 2
24137 8615 2 Change Vistaril 2
24134 8615 3 Reorder Vistaril 2
24135 8615 3 Reorder Vistaril 2

Group 3191 needs to be split into 2 groups.
Group 23523 needs to be split into 3 groups.
Group 8615 needs to be split into 2 groups. This group also has multi dosages orders. This means that for a given OrderNumber there is more than one record (number of records will equal NumDosages).

So what I would like to do is add a column to the table called SubGroup and then write a query that would assign SubGroup numbers to split the groups.

So the result for the example given above could be:

ODE_Id GroupId OrderNumber event DrugName NumDosages SubGroup
----------- ----------- ----------- ---------- -------------------- ----------- -----------
29199 3191 1 Active Trazodone (Bulk) 1 1
29201 3191 2 Change Trazodone 1 2
29203 3191 3 Reorder Trazodone 1 2
29207 3191 4 Reorder Trazodone 1 2
29208 3191 5 Change Trazodone 1 2

20156 23523 1 Active Wellbutrin SR 1 3
20158 23523 2 Reorder Wellbutrin SR 1 3
20159 23523 3 Change Wellbutrin 1 4
20157 23523 4 Change Wellbutrin SR 1 5

24138 8615 1 Active Hydroxyzine HCl 2 6
24139 8615 1 Active Hydroxyzine HCl 2 6
24136 8615 2 Change Vistaril 2 7
24137 8615 2 Change Vistaril 2 7
24134 8615 3 Reorder Vistaril 2 7
24135 8615 3 Reorder Vistaril 2 7

Notes on the values in SubGroup column:

  • Within a Group the SubGroup numbers should be ordered so that if I sort the group by SubGroup, OrderNumber the records will order correctly.

  • SubGroup should be unique across groups. So from the example above if Group 3191 has SubGroups 1 and 2 no other group should have a SubGroup of 1 or 2.

  • Other than the preceding two rules I don't care what SubGroup values any given group is assigned.


At the moment I am not really sure where to start. I will of course be working on the problem and will post any attempts that I come up with.

Laurie

Test Data (it includes the examples given above plus more groups)

insert into TestTable
Select 23989,1985,1,'Active','Hydroxyzine HCl',1 Union All
Select 23990,1985,2,'Change','Vistaril',1 Union All
Select 23991,1985,3,'Discontinu','Vistaril',1 Union All
Select 29199,3191,1,'Active','Trazodone (Bulk)',1 Union All
Select 29201,3191,2,'Change','Trazodone',1 Union All
Select 29203,3191,3,'Reorder','Trazodone',1 Union All
Select 29207,3191,4,'Reorder','Trazodone',1 Union All
Select 29208,3191,5,'Change','Trazodone',1 Union All
Select 3820,5423,1,'Active','Hydroxyzine HCl',1 Union All
Select 3954,5423,2,'Reorder','Hydroxyzine HCl',1 Union All
Select 3979,5423,3,'Reorder','Hydroxyzine HCl',1 Union All
Select 4054,5423,4,'Change','Vistaril',2 Union All
Select 4055,5423,4,'Change','Vistaril',2 Union All
Select 4114,5423,5,'Reorder','Vistaril',2 Union All
Select 4115,5423,5,'Reorder','Vistaril',2 Union All
Select 24138,8615,1,'Active','Hydroxyzine HCl',2 Union All
Select 24139,8615,1,'Active','Hydroxyzine HCl',2 Union All
Select 24136,8615,2,'Change','Vistaril',2 Union All
Select 24137,8615,2,'Change','Vistaril',2 Union All
Select 24134,8615,3,'Reorder','Vistaril',2 Union All
Select 24135,8615,3,'Reorder','Vistaril',2 Union All
Select 3927,10231,1,'Active','Hydroxyzine HCl',2 Union All
Select 3928,10231,1,'Active','Hydroxyzine HCl',2 Union All
Select 3962,10231,2,'Change','Vistaril',2 Union All
Select 3963,10231,2,'Change','Vistaril',2 Union All
Select 4024,10231,3,'Reorder','Vistaril',2 Union All
Select 4025,10231,3,'Reorder','Vistaril',2 Union All
Select 26016,12911,1,'Active','Hydroxyzine HCl',1 Union All
Select 26014,12911,2,'Reorder','Hydroxyzine HCl',1 Union All
Select 26015,12911,3,'Change','Vistaril',1 Union All
Select 26011,12911,4,'Reorder','Vistaril',1 Union All
Select 26012,12911,5,'Change','Hydroxyzine HCl',1 Union All
Select 26013,12911,6,'Change','Hydroxyzine HCl',1 Union All
Select 26017,12911,7,'Reorder','Hydroxyzine HCl',1 Union All
Select 3843,15582,1,'Active','Hydroxyzine HCl',1 Union All
Select 3812,15582,2,'Change','Vistaril',1 Union All
Select 3813,15582,3,'Discontinu','Vistaril',1 Union All
Select 20156,23523,1,'Active','Wellbutrin SR',1 Union All
Select 20158,23523,2,'Reorder','Wellbutrin SR',1 Union All
Select 20159,23523,3,'Change','Wellbutrin',1 Union All
Select 20157,23523,4,'Change','Wellbutrin SR',1

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2012-08-17 : 13:25:59
[code]
select *, dense_rank() over (ORDER BY GroupId,DrugName) AS SubGroup
FROM TestTable
[/code]

------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/

Go to Top of Page

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-17 : 14:03:16
May be this - but it may not order correctly in your second example (GroupId = 23523) where the DrugName changes to something and changes back to the original again. But if you order by GroupId and OrderNumber it would/should order correctly.
SELECT
*,
ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)-
ROW_NUMBER() OVER (PARTITION BY drugname,groupId ORDER BY groupid,ordernumber) AS SubGroup
FROM
testtable
ORDER BY
groupid, ordernumber;
Go to Top of Page

LaurieCox

158 Posts

Posted - 2012-08-17 : 14:47:35
Hi visakh16, sunitabeck

Thanks for the replies. Unfortunately visakh16, your solution does not work for where the drug name changes and then changes back and also does not necessarily order the SubGroups correctly

Sunitabeck yours does divide the groups into the right number of groups but does not (as you pointed out) order SubGroups correctly.

This may or may not be a problem. Because this was only step one in the process. After I have computed the SubGroup I need to do the following:

  1. Renumber the OrderNumber column so that it starts at one for each SubGroup

  2. Change the first event in each SubGroup to Active

  3. Change the last event in each SubGroup (except for the last Subgroup within a group) to Discontinu

  4. Update a comment column (I didn't include it in my TestTable as it wasn't relevant for this initial step) for the first record (determined by OrderNumber) within a given SubGroup that follows a SubGroup (within a given group) that says that this is a continuation of the preceding SubGroup


Step 3 and Step 4 both require that I know the proper order of the SubGroups within a group. I even have code that does this for a similar problem and it depends on the SubGroups being ordered correctly. I think I will work on this code and post it here as somebody maybe able to modify it so it does not depend on SubGroup order (though I am not sure how that would be possible).

Plus I will read up on both dense_rank and partitions.

Again thanks for the help,

Laurie

Edited because when I added a new step 1 to the list above I didn't change the references to the steps in the following text.
Go to Top of Page

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-17 : 15:34:00
The four updates you described seem to be easily doable based on the ordering of the current OrderNumber; so you won't need to rely on the ordering of the SubGroup.

If you really want to, you can get the SubGroup to be ordered as you described - for example like shown below, but as you can see, that requires a whole another sub-query (actually OUTER APPLY) and resulting performance hit.
SELECT
t1.*,
ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)-
ROW_NUMBER() OVER (PARTITION BY drugname,groupId,COALESCE(N,0) ORDER BY groupid,ordernumber) AS SubGroup
FROM
testtable t1
OUTER APPLY
(
SELECT TOP 1
OrderNumber AS N
FROM
testtable t2
WHERE
t2.GroupId = t1.GroupId
AND t2.OrderNumber < t1.OrderNumber
AND t2.DrugName <> t1.DrugName
ORDER BY
t2.orderNumber DESC
) t2

ORDER BY
groupid, ordernumber;
Go to Top of Page

LaurieCox

158 Posts

Posted - 2012-08-17 : 15:52:24
Hi sunitabeck,

Thanks for your reply. That looks like it works perfectly. I will probably go with it as this is a one time thing. I am converting data from one medication prescribing system to another, so I am more interested in saving my development time than processing time. As I said I already have the script written that assumes that the SubGroups are ordered for a similar problem.

Plus it has the added benefit of adding to the list of problems that I posted here that turned out to have the cross and/or outer apply clause as part of the solution.

Again thank you,

Laurie
Go to Top of Page

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-17 : 15:58:37
Glad it worked out - I can relate very well to "saving developer time rather than computer time" :)
Go to Top of Page

LaurieCox

158 Posts

Posted - 2012-10-12 : 15:07:14
I discovered an anomaly in the solution given by sunitabeck above when I ran it against actual data rather than the test data given in the op. I have figured out where the problem was occurring and how to fix, but not exactly why. I am interested in the why.

The problem is when running the solution above against the actual data (30000 rows/55 columns) I get a different result for the SubGroup value if my select statement selects on all columns (t1.* - which gives me the 'correct' values) than if it selects on a subset of the columns (which gives me incorrect values).

So in order to figure out what was happening I modified sunitabeck's solution to give me separate row_number values that make up the SubGroup field (GroupIdRN - MedNameRN):

SELECT t1.*
, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber) as GroupIdRN
, ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0)
ORDER BY groupid,ordernumber) as MedNameRn
, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)-
ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0)
ORDER BY groupid,ordernumber) AS SubGroup
INTO ISC_SplitResultsAllColumns
FROM
ISC_ODE_NotProcessedBeforeSplitChainUpdate t1
OUTER APPLY
(
SELECT TOP 1
OrderNumber AS N
FROM
ISC_ODE_NotProcessedBeforeSplitChainUpdate t2
WHERE
t2.GroupId = t1.GroupId
AND t2.OrderNumber < t1.OrderNumber
AND t2.MedicationNameId <> t1.MedicationNameId
ORDER BY
t2.orderNumber DESC
) t2
ORDER BY
groupid, ordernumber;

When I select from the solution table for just the interesting columns I get this result:

GroupId OrderNumber event NumDosages MedicationNameId ODE_Id GroupIdRN MedNameRn SubGroup
------- ----------- ---------- ----------- ---------------- ------ ---------- --------- --------
8615 1 Active 2 3721 24138 6981 1 6980
8615 1 Active 2 3721 24139 6982 2 6980

8615 2 Change 2 53 24136 6983 1 6982
8615 2 Change 2 53 24137 6984 2 6982
8615 3 Reorder 2 53 24134 6985 3 6982
8615 3 Reorder 2 53 24135 6986 4 6982
8615 4 Reorder 2 53 24130 6987 5 6982
8615 4 Reorder 2 53 24132 6988 6 6982
8615 5 Discontinu 2 53 24133 6989 7 6982
8615 5 Discontinu 2 53 24131 6990 8 6982

This successfully split the group into two groups; splitting them when the MedicationNameId changed.

But when I run this query (selecting only on a few of the columns):

SELECT t1.GroupId
, t1.OrderNumber
, t1.event
, t1.NumDosages
, t1.MedicationNameId
, t1.ODE_Id
, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber) as GroupIdRN
, ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0)
ORDER BY groupid,ordernumber) as MedNameRN
, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)-
ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0)
ORDER BY groupid,ordernumber) AS SubGroup
into ISC_SplitResultsSelectedColumns
FROM ISC_ODE_NotProcessedBeforeSplitChainUpdate t1
OUTER APPLY
(
SELECT TOP 1
OrderNumber AS N
FROM
ISC_ODE_NotProcessedBeforeSplitChainUpdate t2
WHERE
t2.GroupId = t1.GroupId
AND t2.OrderNumber < t1.OrderNumber
AND t2.MedicationNameId <> t1.MedicationNameId
ORDER BY
t2.orderNumber DESC
) t2
ORDER BY
groupid, ordernumber;

I get this result:

GroupId OrderNumber event NumDosages MedicationNameId ODE_Id GroupIdRN MedNameRN SubGroup
------- ----------- ---------- ----------- ---------------- ------ --------- --------- --------
8615 1 Active 2 3721 24139 6982 2 6981

8615 1 Active 2 3721 24138 6981 1 6979

8615 2 Change 2 53 24136 6983 2 6982
8615 2 Change 2 53 24137 6984 1 6982
8615 3 Reorder 2 53 24135 6986 4 6982
8615 3 Reorder 2 53 24134 6985 3 6982
8615 4 Reorder 2 53 24132 6987 6 6982
8615 4 Reorder 2 53 24130 6988 5 6982
8615 5 Discontinu 2 53 24133 6990 8 6982
8615 5 Discontinu 2 53 24131 6989 7 6982

Splitting it into three groups. The difference is found in the GroupIdRN and MedNameRN columns:

All columns selected:
ODE_Id GroupIdRN MedNameRn SubGroup
------ ---------- --------- --------
24138 6981 1 6980
24139 6982 2 6980


Columns restricted:
ODE_Id GroupIdRN MedNameRN SubGroup
------ --------- --------- --------
24139 6982 2 6981
24138 6981 1 6979

SubGroup should be calculated by GroupIdRN – MedNameRN. But when I select on the restricted columns the calculations, don't seem to be correct (6982 – 2 does not equal 6981). This problem only happened when there was more than one record for a given GroupId/OrderNumber pairing; which occurs if the NumDosages column is greater than 1 (one record for each dosage).

My solution was to change row_number to dense_rank:

SELECT t1.MyIdentity
, t1.GroupId
, t1.OrderNumber
, t1.event
, t1.NumDosages
, t1.MedicationNameId
, t1.ODE_Id
, DENSE_RANK() OVER (ORDER BY groupid,ordernumber) as GroupIdRN
, DENSE_RANK() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0)
ORDER BY groupid,ordernumber) as MedNameRN
, DENSE_RANK() OVER (ORDER BY groupid,ordernumber)-
DENSE_RANK() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0)
ORDER BY groupid,ordernumber) AS SubGroup
INTO ISC_SplitResultsDenseRank
FROM
ISC_ODE_NotProcessedBeforeSplitChainUpdate t1
OUTER APPLY
(
SELECT TOP 1
OrderNumber AS N
FROM
ISC_ODE_NotProcessedBeforeSplitChainUpdate t2
WHERE
t2.GroupId = t1.GroupId
AND t2.OrderNumber < t1.OrderNumber
AND t2.MedicationNameId <> t1.MedicationNameId
ORDER BY
t2.orderNumber DESC
) t2
ORDER BY
groupid, ordernumber;

Which gives this result:

GroupId OrderNumber event NumDosages MedicationNameId ODE_Id GroupIdRN MedNameRN SubGroup
------- ----------- ---------- ----------- ---------------- ----------- --------- --------- --------
8615 1 Active 2 3721 24139 5934 1 5933
8615 1 Active 2 3721 24138 5934 1 5933

8615 2 Change 2 53 24136 5935 1 5934
8615 2 Change 2 53 24137 5935 1 5934
8615 3 Reorder 2 53 24135 5936 2 5934
8615 3 Reorder 2 53 24134 5936 2 5934
8615 4 Reorder 2 53 24132 5937 3 5934
8615 4 Reorder 2 53 24130 5937 3 5934
8615 5 Discontinu 2 53 24133 5938 4 5934
8615 5 Discontinu 2 53 24131 5938 4 5934

Which had the added advantage of having MedNameRN be the valid OrderNumber for the new groups.

So my question is: Why is the calculation wrong when I select on a subset of the columns? I think having an understanding of the why would help in my quest to become more proficient/knowledgeable about sql and the how the ranking functions work.

I cannot give you test data as the anomaly only occurs if I ran the solution on the whole table (30000 rows and 55 columns).

p.s.

When I discovered the problem instead of adding to this topic I opened a second thread Columns selected changes results huh? about the anomaly. This netted me only one (rightfully confused) response. I never got back to either thread because I was very busy and the solution mostly worked.

Edit: Linked to wrong thread.
Go to Top of Page

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2012-10-12 : 17:11:23
Hi there AlgaeSea!

This isn't a thorough, super detailed response here, but I think I see the problem and it's a universal truth about RDBMSes: SQL Server can only guarantee EXPLICIT ordering. Since T-SQL is a DECLARED language, the optimizer is free to structure the execution plan and data access however it sees best in order to satisfy your requirements. You in fact told it that you really didn't care how it ordered matching records, so you left it free to do so in whatever manner made the most sense performance wise.

Let's take a look at the windowing function you use for GroupIdRN:

ROW_NUMBER() OVER (ORDER BY groupid,ordernumber)


When it hits your example (GroupID 8615, OrderNumber 1), it sees two rows that look the same. Since the expression is only evaluating these two columns, it makes no difference which receives 1 and which receives 2 to the engine. In most cases, this will simply be assigned according to the internal order of the recordset. I'm throwing out a hypothesis here, but when you select the whole recordset it might be using the clustered index versus a the subset which may be taking advantage of a nonclustered index. There could be other reasons for this as well, such as if any sorting steps that are required in the execution plan to satisfy the OUTER APPLY.

Also, keep in mind that this function is evaluated and operating independently of the same code used to calculate SubGroup (which is why you see differences between the two). It is certainly unexpected to see them differ, but it's not out of the realm of possibility.

A way you could correct this behavior is if you could use a unique tie breaker -- let's say a surrogate primary key such as an IDENTITY:

ROW_NUMBER() OVER (ORDER BY groupid,ordernumber,ODE_Id)


That will ensure the result is repeatable since there will always be an explicit order even in the case of ties.

One other thing: there are many times you'll solutions on these boards that don't take this principal into consideration. I am certainly guilty of this as well (see my string concatenation method for "SQL Script to compare Database"), but in many scenarios the internal order can be consistent if not guaranteed. Just something to be aware of.
Go to Top of Page

LaurieCox

158 Posts

Posted - 2012-10-15 : 13:41:49
Hi lazerath,

From my reading about the row_number function I figured it was non-deterministic for equal rows and you are right breaking the tie by adding ODE_id to the order by clauses works. It is still not as good a fix as using the dense_rank function (especially since the MedNameRN column gives me the correct order_number for the new split groups).

I still find it interesting that limiting the number of extra columns selected would change how it works (I guess that's what's meant by non-deterministic).

It is also obvious that pulling the parts of the equation out to try to debug did not really help; as it obviously 'used' different rows when getting the parts for calculating SubGroup as which rows it 'used' for the MedNameRN and GroupIdRN columns

Anyway adding the ODE_Id:

SELECT t1.GroupId
, t1.OrderNumber
, t1.event
, t1.NumDosages
, t1.MedicationNameId
, t1.ODE_id
, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber,ODE_id) as GroupIdRN
, ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0)
ORDER BY groupid,ordernumber,ODE_id) as MedNameRN
, ROW_NUMBER() OVER (ORDER BY groupid,ordernumber,ODE_id)-
ROW_NUMBER() OVER (PARTITION BY MedicationNameId,groupId,COALESCE(N,0)
ORDER BY groupid,ordernumber,ODE_id) AS SubGroup
FROM ISC_ODE_NotProcessedBeforeSplitChainUpdate t1
OUTER APPLY
(
SELECT TOP 1
OrderNumber AS N
FROM
ISC_ODE_NotProcessedBeforeSplitChainUpdate t2
WHERE
t2.GroupId = t1.GroupId
AND t2.OrderNumber < t1.OrderNumber
AND t2.MedicationNameId <> t1.MedicationNameId
ORDER BY
t2.orderNumber DESC
) t2
ORDER BY
groupid, ordernumber;

Netted me this:

GroupId OrderNumber event NumDosages MedicationNameId ODE_Id GroupIdRN MedNameRN SubGroup
------- ----------- ---------- ----------- ---------------- ----------- --------- --------- --------
8615 1 Active 2 3721 24138 6981 1 6980
8615 1 Active 2 3721 24139 6982 2 6980

8615 2 Change 2 53 24136 6983 1 6982
8615 2 Change 2 53 24137 6984 2 6982
8615 3 Reorder 2 53 24134 6985 3 6982
8615 3 Reorder 2 53 24135 6986 4 6982
8615 4 Reorder 2 53 24130 6987 5 6982
8615 4 Reorder 2 53 24132 6988 6 6982
8615 5 Discontinu 2 53 24131 6989 7 6982
8615 5 Discontinu 2 53 24133 6990 8 6982

But I still prefer the dense_rank solution.

But it has been fun figuring things out and I learned more about the window functions.

Thank you for your response,

Laurie
Go to Top of Page
   

- Advertisement -