query help: similar to group by - concat

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

query help: similar to group by - concat

Author

Topic

natejax
Starting Member

9 Posts

Posted - 2009-08-05 : 14:31:15

I have a Many-Many (or bridge) table:

THISGUY_ID INT, SANDWHICH INT

example data:

THISGUY | SANDWHICH
2001 1
2001 4
202 1
34 1
25 1
25 5
29 5
29 4
36 5
36 4

What i would like to do is select out the distinct groupings of sandwhiches along with some generic unique identifer, so from above, i would get:

SANDWHICH_BAG_ID, SANDWHICH_ID
1 1
1 4
2 1
3 1
3 5
4 5
4 4

can you please help me figure this one out without cursors?

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-05 : 14:44:11

I don't understand why your expected output is seven rows. I would think either eleven (one row for each in example data) or three rows (one for each sandwhich_id.

Be One with the Optimizer
TG

natejax
Starting Member

9 Posts

Posted - 2009-08-05 : 14:52:14

basically, it could be broken down into a 2 step process:

first i want all the distinct group of all guys and the sandwhiches they like. But, then i want a distinct group of the sandwhich groupings.

for example, both Jim and Ted will only eat tunafish and PB&J sandwhiches. I need a query that will identify the fact that these 2 sandwhiches are preferred by the same individual, but that will only select the sandwhiches grouping one time. obviously, i need the surrogate key/id to keep the relationship between the sandwhiches intact.

to put it in a more relevant form. you are a healthcare analyst and want to see what distinct combinations of drugs people are taking together.

cursors would be a big problem because of the volume of data and processing time.

this would be exactly like using "GROUPBY-CONCAT" but then running a "select distinct" on top of that to get only the distinct results. the trick is that i need to keep the values in their own rows.. and again, no cursors :( .

does that make sense?

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-05 : 15:25:55

about how many distinct drugs (sandwiches) do you have total?

Be One with the Optimizer
TG

natejax
Starting Member

9 Posts

Posted - 2009-08-05 : 15:32:32

i really appreciate any help that you could give me. i am hoping to better illustrate what i am trying to do. this is the SQL i used to get the same thing but all the drugs/sandwhiches as you prefer are in the same field. at the end, you get a table with distinct list of drugs/sandwhiches taken/eaten together.. now i could go through that with a cursor and generate insert statements for each delimited value, row by row, but that would be very slow. so i am going back to the source table and wondering if i can pull them out directly into another table while keeping the relationship intact?

--source table
CREATE TABLE [dbo].[Z_SCRAP_DRG_MM]
(PERS_KEY int, DRG_KEY int)

-- function to group by concat
create function dbo.DRG_CONCAT(@s_key int)
returns varchar(5000)
as
begin
declare @out varchar(5000)
select @out = coalesce(@out + '|' + convert(varchar,DRG_KEY), convert(varchar,DRG_KEY))
from Z_SCRAP_DRG_MM
where S_KEY = @s_key

return @out
end

--create table for persons with their clusters of drugs
CREATE TABLE [dbo].[DRG_MM](
[SCRAP_KEY] [int] IDENTITY(1,1) NOT NULL,
[PERS_KEY] [int] not null,
[CCOUNT] int not null,
[DRG_CLUSTER] varchar (5000) not null,

CONSTRAINT [PK_DRG_MM] PRIMARY KEY CLUSTERED
(
[SCRAP_KEY] ASC
)
)

GO

--populate table for persons with clusters of drugs
INSERT INTO DRG_MM
(PERS_KEY, CCOUNT, DRG_CLUSTER)
SELECT DISTINCT PERS_KEY, CCOUNT, '|' + dbo.DRG_CONCAT(PERS_KEY) + '|' DRG_CLUSTER
FROM (SELECT PERS_KEY, COUNT(DRG_KEY) CCOUNT FROM Z_SCRAP_DRG_MM GROUP BY S_KEY) X
ORDER BY PERS_KEY

GO

--create table for distinct clusters of drugs with key
CREATE TABLE [dbo].[DRG_C](
[DRG_C_KEY] [int] IDENTITY(1,1) NOT NULL,
[D_CLUSTER] varchar (5000) not null,
[D_COUNT] int,
CONSTRAINT [PK_DRG_C] PRIMARY KEY CLUSTERED
(
[DRG_C_KEY] ASC
)
)

GO

--populate distint clusters of drugs with key
INSERT INTO [DRG_C]
(D_CLUSTER, D_COUNT)
SELECT DISTINCT DRG_CLUSTER, CCOUNT FROM [DRG_MM] ORDER BY DRG_CLUSTER

GO

natejax
Starting Member

9 Posts

Posted - 2009-08-05 : 15:34:04

this same issue applies to a variety of situations i am looking at. some of them might have 10,000 or so while others are up in the several hundreds of thousands.

natejax
Starting Member

9 Posts

Posted - 2009-08-05 : 16:03:24

i can't shake the feeling that there is an easy way to do this that involves a technique i just have not seen or recently used.

but i do have an idea. i am thinking i can use a max statement in my existing process to get a single person_ID along with the cluster. then i can join back and populate a table with unique clusters (but in the desired one value per row form) and their associated max person_Ids . then i can figure out a way to swap the person-id for a surrogate key.

might not be pretty but it will be fast.

edit: actually, i can just use the surrogate key from the cluster table. i appreciate the follow-up on this. i'd still be interested in learning if there is a more efficient way to do this. thanks for the input. i am not sure i would have figured it out without posting it, which made me think it through more.

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-05 : 16:11:20

Here is what I'm thinking. Of course this won't work with 100000 different sandwiches but I think the idea would if we used a binary value instead of an int. The idea is to create a single value that would represent a specific combination of sandwiches. In this case it would be the sum of: 2 raised to the power of sandwich_id.



declare @t table (THISGUY int,  SANDWHICH int)
insert @t
select 2001, 1 union all
select 2001, 4 union all
select 202, 1 union all
select 34, 1 union all
select 25, 1 union all
select 25, 5 union all
select 29, 5 union all
select 29, 4 union all
select 36, 5 union all
select 36, 4


              
select sandwich_bag_id
       ,n.number as sandwich
       --,sandwiches
       --,sandwiches & power(2, n.number)
from   (
       select sandwiches
              ,row_Number() over (order by sandwiches) as sandwich_bag_id
       from   (
              select sum(power(2, sandwhich)) as sandwiches
              from @t
              group by thisguy
              ) d
       group by sandwiches
       ) d
cross join (select distinct sandwhich as number from @t) n
where  sandwiches & power(2, n.number) > 0


output:
sandwich_bag_id      sandwich
-------------------- -----------
1                    1
2                    1
2                    4
3                    1
3                    5
4                    4
4                    5

Be One with the Optimizer
TG

natejax
Starting Member

9 Posts

Posted - 2009-08-05 : 16:24:45

thats pretty impressive. i would have to spend some time to figure out exactly what you are doing because i have never written anything remotely like that (and i think i am pretty decent with SQL). it may well be more efficient than what i have but ill be sticking with the "workaround" for now. either way, ill be looking that over to see if i can follow it. i think i may get the general concept if it is rooted in the maximum possible combinations of sandwhiches and then you join back to get ones that are in use.

the one thing with yours that would be a problem for me is that you use ROW_COUNT and I have to support SQL2K... this has been a real source of frustration for me.

thanks for taking the time to work that out. ill post back what i am doing in case someone else runs into something like this. again, thanks TG!

edit - i am not sure i see you have row_number() function. not sure if that is the ROW COUNT feature that came with 2K5 that i was thinking or if i am confusing it wwith something else...

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-05 : 16:40:46

the row_number is a 2005 thing but it isn't really necessary here. It was just the surrogate id to match your desired output. You could the the sandwiches value instead:


declare @t table (THISGUY int,  SANDWHICH int)
insert @t
select 2001, 1 union all
select 2001, 4 union all
select 202, 1 union all
select 34, 1 union all
select 25, 1 union all
select 25, 5 union all
select 29, 5 union all
select 29, 4 union all
select 36, 5 union all
select 36, 4

select sandwiches as sandwich_bag_id
       ,n.number as sandwich
from   (
       select sandwiches
       from   (
              select sum(power(2, sandwhich)) as sandwiches
              from @t
              group by thisguy
              ) d
       group by sandwiches
       ) d
cross join (select distinct sandwhich as number from @t) n
where  sandwiches & power(2, n.number) > 0

output:
sandwich_bag_id sandwich
--------------- -----------
2               1
18              1
18              4
34              1
34              5
48              4
48              5

Be One with the Optimizer
TG

natejax
Starting Member

9 Posts

Posted - 2009-08-06 : 11:00:48

thanks TG. i wanted to post back what i wound up doing - quid pro quo. i've used this board a lot in the past but rarely post because i generally find what i need just using the search function.

i do need the other tables so its not a bad way to go for me. i had to cut out a lot of extra logic and make it generic enough so im not posting right out of my actual work, and i woke up with a migraine... and i took some medicine so the migraine is gone but now i might as well be in outer space....

so it would be suprising if there are not some minor errors in the code, but the real code it is based on works so the concept is solid. i still need to look over your code, TG, you bring a mathematical approach that i have never seen before and i can see it coming in handy in a variety of situations.

--source table
CREATE TABLE [dbo].[Z_SCRAP_DRG_MM]
(PERS_KEY int, DRG_KEY int)

-- function to group by concat
create function dbo.DRG_CONCAT(@s_key int)
returns varchar(5000)
as
begin
declare @out varchar(5000)
select @out = coalesce(@out + '|' + convert(varchar,DRG_KEY), convert(varchar,DRG_KEY))
from Z_SCRAP_DRG_MM
where S_KEY = @s_key

return @out
end

--create table for persons with their clusters of drugs
CREATE TABLE [dbo].[DRG_MM](
[SCRAP_KEY] [int] IDENTITY(1,1) NOT NULL,
[PERS_KEY] [int] not null,
[CCOUNT] int not null,
[DRG_CLUSTER] varchar (5000) not null,

CONSTRAINT [PK_DRG_MM] PRIMARY KEY CLUSTERED
(
[SCRAP_KEY] ASC
)
)

GO

--populate table for persons with clusters of drugs
INSERT INTO DRG_MM
(PERS_KEY, CCOUNT, DRG_CLUSTER)
SELECT DISTINCT PERS_KEY, CCOUNT, '|' + dbo.DRG_CONCAT(PERS_KEY) + '|' DRG_CLUSTER
FROM (SELECT PERS_KEY, COUNT(DRG_KEY) CCOUNT FROM Z_SCRAP_DRG_MM GROUP BY S_KEY) X
ORDER BY PERS_KEY

GO

--create table for distinct clusters of drugs with key
CREATE TABLE [dbo].[DRG_C](
[DRG_C_KEY] [int] IDENTITY(1,1) NOT NULL,
[D_CLUSTER] varchar (5000) not null,
[D_COUNT] int,
CONSTRAINT [PK_DRG_C] PRIMARY KEY CLUSTERED
(
[DRG_C_KEY] ASC
)
)

GO
--populate distinct drug clusters
INSERT INTO [DRG_C]
(D_CLUSTER, D_COUNT)
SELECT DISTINCT CLUSTER, CCOUNT
FROM [DRG_MM]
ORDER BY CLUSTER

GO
--table for clusters in row format
CREATE TABLE [hedisuser1].[DRG_GROUP](
[DG_KEY] [int] IDENTITY NOT NULL,
[DRG_C_KEY] [int] NOT NULL,
[DRG_KEY] int not null
--END INSERT COLUMNS
CONSTRAINT [PK_DRG_GROUP] PRIMARY KEY CLUSTERED
(
[DG_KEY] ASC
)
)

GO
-- query to bring in clusters with each drug on its own row
--sorry i have other logic in here that i cut out so the subqueries are probably no longer necessary
INSERT INTO [DRG_GROUP]
(DRG_C_KEY, DRG_KEY)
SELECT DISTINCT DRG_C_KEY, DRG_KEY FROM
(SELECT DC.DRG_C_KEY , MIN(MM.PERS_KEY) PERS_KEY
FROM
DRG_MM MM INNER JOIN DRG_C DC
ON MM.CLUSTER = DC.D_CLUSTER
GROUP BY DRG_C_KEY
) XX
JOIN
(
SELECT DISTINCT PERS_KEY, DRG_KEY
FROM Z_SCRAP_DRG_MM) HC
ON HC.PERS_KEY = XX.PERS_KEY
ORDER BY DRG_C_KEY, DRG_KEY

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-06 : 11:24:02

Thanks for updating us!

Our approaches are logically quite similar. The only differences are that I compressed the steps into a single statement and secondly, rather than concatenating the drugs into a "drug cluster" I used a (not too uncommon) technique of storing a single value comprised of the SUM of (2) raised to the power of the individual IDs.
So for instance given these IDs: (0, 1, 2, 3)
translates to: power(2, 0) + power(2,1) + power(2,2) + power(2,3)
which resolves to: 1 + 2 + 4 + 8
so in this case "15" represents these 4 IDs.

It is a handy trick but only works for a relatively small number of IDs. In the post with my first solution I suggested that with thousands of Drugs we would need to use a binary value rather than an int. This is similar to the way sql server stores user's [Roles] in the sysUsers table. But you could use your drug_cluster function to achieve the same thing.

Any way, thanks again for wrapping things up for us - glad it's working and keep posting :)

Be One with the Optimizer
TG

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources