SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Recursive query without a hierarchy - .5 mill recs
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

LearningAsIGo
Starting Member

9 Posts

Posted - 08/23/2012 :  06:56:39  Show Profile  Reply with Quote
Hi,

I've inherited a table with data that needs to be grouped.

In each row, there are 2 IDs that are related - Id1 and Id2. They are both integers, and, in each row, Id1 is *always* less than Id2.

It is possible for IDs in different rows to be related to each other, which can be tracked by either Row1.Id1 or Row1.Id2 appearing in one of the Id fields of Row2, Row3, etc.

I need to track all those relations and group all related IDs under the same common GroupId. For example, Row1 might have Id1=20 and Id2=30. Row2 might have Id1=30 and Id2= 40. Since "30" shows up in both rows, then all the IDs are related, and I'd like to group them together in a new table that has a new GroupId (an identity column) and the ID itself. Please see below for a SQL sample.

I was thinking of using a CTE, but there is no hierarchy to the data, and I'm not sure what to try next.

If anybody has the time to take a look, I would love to hear some ideas. Thank you!


-- Set up sample data
CREATE TABLE #Associations
(Id1 INT,
Id2 INT)

INSERT INTO #Associations 
SELECT 100,200 UNION ALL
SELECT 100,300 UNION ALL
SELECT 100,400 UNION ALL 
SELECT 200,500 UNION ALL
SELECT 200,600 UNION ALL
SELECT 300,700 UNION ALL
SELECT 600,800 UNION ALL
SELECT 900,1000 UNION ALL
SELECT 1000,1100 UNION ALL
SELECT 1100,1200 UNION ALL
SELECT 1200,1300 UNION ALL
SELECT 1400,1500 


-- Show mocked-up version of the expected result

CREATE TABLE #ExpectedResult
(GroupId INT,
Id INT) 

INSERT INTO #ExpectedResult
SELECT 1,100 UNION ALL
SELECT 1,200 UNION ALL 
SELECT 1,300 UNION ALL
SELECT 1,400 UNION ALL
SELECT 1,500 UNION ALL
SELECT 1,600 UNION ALL 
SELECT 1,700 UNION ALL
SELECT 1,800 UNION ALL
SELECT 2,900     UNION ALL
SELECT 2,1000	   UNION ALL
SELECT 2,1100	   UNION ALL
SELECT 2,1200	   UNION ALL
SELECT 2,1300	   UNION ALL
SELECT 3,1400	   UNION ALL
SELECT 3,1500   

-- See both tables

SELECT * FROM #Associations
SELECT * FROM #ExpectedResult

-- Clean up

--DROP TABLE #Associations 
--DROP TABLE #ExpectedResult



Edited by - LearningAsIGo on 08/23/2012 18:05:57

sunitabeck
Flowing Fount of Yak Knowledge

5155 Posts

Posted - 08/23/2012 :  07:48:00  Show Profile  Reply with Quote
Very nice posting!! So easy to copy the DDL's and see what you are trying to accomplish. Would this do what you are trying to accomplish?
;WITH cte AS
(
	SELECT id1,id2, DENSE_RANK() OVER (ORDER BY id1) AS RN FROM #Associations a
	WHERE NOT EXISTS 
	(SELECT * FROM #Associations b WHERE a.id1 = b.id2)
	
	UNION ALL
	
	SELECT a.id1,a.id2, RN FROM #Associations a
	INNER JOIN cte c ON c.id2 = a.Id1
)
SELECT id1,rn FROM cte UNION SELECT id2,rn FROM cte order by rn;

Edited by - sunitabeck on 08/23/2012 07:48:25
Go to Top of Page

LearningAsIGo
Starting Member

9 Posts

Posted - 08/23/2012 :  16:05:26  Show Profile  Reply with Quote
Hi Sunita,

Thank you for your time and for the compliment. I am really grateful to all the experts here in the forum and try my best not to waste their/your time with badly formatted questions :).

The query definitely works with the small sample set and is what I was looking for, but it doesn't scale well with the 500,000 records in my table. So far, it has taken 24 minutes and is still running. I'm going to let it run a little longer and see what happens.

Thanks again, Sunita.
Go to Top of Page

LearningAsIGo
Starting Member

9 Posts

Posted - 08/24/2012 :  09:05:08  Show Profile  Reply with Quote
Update: I let the query run to about an hour and then stopped it because it would take too long for the requirements under which the query has to be used.

If there are any suggestions on how to approach this for a large record set, I would be very grateful.

And, thank you, Sunita, for showing how to do a recursive CTE without a hierarchy. I didn't even know that was possible.
Go to Top of Page

sunitabeck
Flowing Fount of Yak Knowledge

5155 Posts

Posted - 08/24/2012 :  09:14:07  Show Profile  Reply with Quote
Do you have any indexes on the table? If you don't and if you are able to add indexes on ID1 and/or ID2, that would help, I think.
Go to Top of Page

LearningAsIGo
Starting Member

9 Posts

Posted - 08/24/2012 :  09:15:40  Show Profile  Reply with Quote
Yep, there are indexes on Id1 and Id2.
Go to Top of Page

sunitabeck
Flowing Fount of Yak Knowledge

5155 Posts

Posted - 08/24/2012 :  09:27:33  Show Profile  Reply with Quote
I don't think this will be any faster - but another thought. On smaller sets, this would certainly perform slower, but there is a possibility that it may be better for larger sets. Also, it is destructive, so save your original table.
CREATE TABLE #tmp (id1 INT, id2 INT, RN INT);

DELETE FROM a
OUTPUT DELETED.id1,DELETED.id2,NULL
INTO #tmp (id1,id2,RN)
FROM
	#Associations a
WHERE
	NOT EXISTS 
	( SELECT * FROM #Associations b WHERE a.id1 = b.id2 ); 
	
;WITH cte AS
(SELECT RN, DENSE_RANK() OVER (ORDER BY id1) AS RN2 FROM #tmp)
UPDATE cte SET RN = RN2;

DECLARE @rows INT = 1;
WHILE (@rows > 0)
BEGIN
	DELETE FROM a
	OUTPUT DELETED.id1,DELETED.id2,c.RN
	INTO #tmp (id1,id2,RN)
	FROM
		#Associations a
		INNER JOIN  #tmp c ON c.id2 = a.Id1
	SET @rows = @@ROWCOUNT;
END

SELECT id1,rn FROM #tmp UNION SELECT id2,rn FROM #tmp ORDER BY 2,1
Go to Top of Page

LearningAsIGo
Starting Member

9 Posts

Posted - 08/25/2012 :  08:36:54  Show Profile  Reply with Quote
Wow! That took 10 seconds. I am flabbergasted and very humbled & impressed. My jaw literally dropped. Thank you so much.

There is one issue, though, that came up with 8% of the IDs because my sample data above didn't take into account one scenario.

The issue occurs when one row's Id2 exists elsewhere, but the row's Id1 doesn't exist anywhere else. In this instance, the existing Id's existing RN doesn't get reused, but an entirely new RN is calculated.

Here's an example to clarify. I've added two records to the insert - rows D and E - and also added the provided solution for easier copying.

Let's start with Row D.

In Row D, I added Id1 = 150 and Id2 = 200. Since the id value "200" already exists in Row A, the RN from Row A should be used here. Instead, a new RN gets calculated. The same thing happens in Row E - a new RN is calculated. This new RN is then applied to all subsequent IDs that are linked to "200".

Is there a way to address this? I know I've already taken up a lot of your time and I apologize for the incomplete sample data.


CREATE TABLE #Associations
(Id1 INT,
Id2 INT)

INSERT INTO #Associations 
/*A*/ SELECT 100,200 UNION ALL
/*B*/ SELECT 100,300 UNION ALL
/*C*/ SELECT 100,400 UNION ALL 

/* New records below */
/*D*/ SELECT 150,200 UNION ALL 
/*E*/ SELECT 160,200 UNION ALL
/* New records above */

/*F*/ SELECT 200,500 UNION ALL
/*G*/ SELECT 200,600 UNION ALL
/*H*/ SELECT 300,700 UNION ALL
/*I*/ SELECT 600,800 UNION ALL
/*J*/ SELECT 900,1000 UNION ALL
/*K*/ SELECT 1000,1100 UNION ALL
/*L*/ SELECT 1100,1200 UNION ALL
/*M*/ SELECT 1200,1300 UNION ALL
/*N*/ SELECT 1400,1500 

 

---- Table backup (quick & dirty)

CREATE TABLE #Associations_Backup (id1 INT, id2 INT)
INSERT INTO #Associations_Backup (id1, id2)
SELECT id1, id2 FROM #Associations

---- Sunita's speedy solution 

CREATE TABLE #tmp (id1 INT, id2 INT, RN INT);

DELETE FROM a
OUTPUT DELETED.id1,DELETED.id2,NULL
INTO #tmp (id1,id2,RN)
FROM
	#Associations a
WHERE
	NOT EXISTS 
	( SELECT * FROM #Associations b WHERE a.id1 = b.id2 ); 
	
;WITH cte AS
(SELECT RN, DENSE_RANK() OVER (ORDER BY id1) AS RN2 FROM #tmp)
UPDATE cte SET RN = RN2;

DECLARE @rows INT = 1;
WHILE (@rows > 0)
BEGIN
	DELETE FROM a
	OUTPUT DELETED.id1,DELETED.id2,c.RN
	INTO #tmp (id1,id2,RN)
	FROM
		#Associations a
		INNER JOIN  #tmp c ON c.id2 = a.Id1
	SET @rows = @@ROWCOUNT;
END

SELECT id1,rn FROM #tmp UNION SELECT id2,rn FROM #tmp ORDER BY 2,1

---- Clean up

 --DROP TABLE #Associations
 --DROP TABLE #Associations_Backup
 --DROP TABLE #tmp








Go to Top of Page

joe8079
Posting Yak Master

USA
127 Posts

Posted - 08/25/2012 :  15:50:57  Show Profile  Reply with Quote
Question:

I always thought while loops were considered evil on this board so I'm kind of confused as to how this would work or why one should use a while loop here:


WHILE (@rows > 0)
BEGIN
DELETE FROM a
OUTPUT DELETED.id1,DELETED.id2,c.RN
INTO #tmp (id1,id2,RN)
FROM
#Associations a
INNER JOIN #tmp c ON c.id2 = a.Id1
SET @rows = @@ROWCOUNT;
END
Go to Top of Page

sunitabeck
Flowing Fount of Yak Knowledge

5155 Posts

Posted - 08/26/2012 :  09:52:35  Show Profile  Reply with Quote
I would not have expected that much of a change in performance either. Now I am suspecting whether there is something in the data that causes circular references or other things that cause the CTE to go into more recursions than we are expecting it to.

In any case, following through with the example that you posted, is it possible that there are multiple linkages - for example, if you were to add another row with id1= 160, id2=1000 (and continuing on for example by adding id1=1400, id2=1000)? Assuming that that is not the case, if you change the group updating as shown below, it should group the data correctly.
;WITH cte AS
(SELECT RN, DENSE_RANK() OVER (ORDER BY id1)
	- ROW_NUMBER() OVER(PARTITION BY id2 ORDER BY id1) AS RN2 FROM #tmp)
UPDATE cte SET RN = RN2;
This would assign unique group ids, but with gaps. If you do need the group id's to be sequential, then:
;WITH cte AS
(SELECT RN, DENSE_RANK() OVER (ORDER BY id1)
	- ROW_NUMBER() OVER(PARTITION BY id2 ORDER BY id1) AS RN2 FROM #tmp),
cte2 AS 
(SELECT *,DENSE_RANK() OVER (ORDER BY RN2) AS RN3 FROM cte)
UPDATE cte2 SET RN = RN3;
Go to Top of Page

sunitabeck
Flowing Fount of Yak Knowledge

5155 Posts

Posted - 08/26/2012 :  09:56:36  Show Profile  Reply with Quote
quote:
Originally posted by joe8079

Question:

I always thought while loops were considered evil on this board so I'm kind of confused as to how this would work or why one should use a while loop here:


WHILE (@rows > 0)
BEGIN
DELETE FROM a
OUTPUT DELETED.id1,DELETED.id2,c.RN
INTO #tmp (id1,id2,RN)
FROM
#Associations a
INNER JOIN #tmp c ON c.id2 = a.Id1
SET @rows = @@ROWCOUNT;
END

WHILE loops are indeed evil in most cases but it is possible that one or more of the following contribute to better performance in this case:

1. There is something in the data that causes the recursive query to go into infinite recursions or collecting more data than we are expecting in each level such that the number of rows in successive recursions are growing exponentially.

2. In the while loop query, because we are deleting the data from the table, the size progressively decreases.

3. The while loop query is multiple statements, but the recursive query is a single statement. So many smaller transactions vs one large transaction making the smaller transactions run faster.

I am just guessing - don't really know which if any of these are causing the better performance.
Go to Top of Page

LearningAsIGo
Starting Member

9 Posts

Posted - 08/26/2012 :  17:08:45  Show Profile  Reply with Quote
I was wondering about the miracle loop, too, so thank you, joe, for asking and Sunita for taking the time to explain.

quote:
Originally posted by sunitabeck


In any case, following through with the example that you posted, is it possible that there are multiple linkages - for example, if you were to add another row with id1= 160, id2=1000 (and continuing on for example by adding id1=1400, id2=1000)?



I just did some analysis on the data and there is a lot of that happening, with the caveat that Id1 is always less than Id2.

The following happens quite frequently:
If RowA(Id1,Id2) = 2,3 and RowB(Id1,Id2) = 2,4, then RowJ(Id1,Id2) = 3,4. For instances where any given Id1 has a bunch of Id2s, many of the Id2s pair up by themselves later on.

--Edited to add: After a bit of Googling, I think this is what was done to the original data: http://stackoverflow.com/questions/4378698/return-all-possible-combinations-of-values-on-columns-in-sql

Is there a way to undo that?
--End of edit

At this point, would it be better just go back to the client and request better data or is there any way to beat the circular references? Knowing this client, though, I'll probably end up right back here in a couple of weeks with more questions about how to make sense of the new stuff :).

Edited by - LearningAsIGo on 08/27/2012 04:45:33
Go to Top of Page

sunitabeck
Flowing Fount of Yak Knowledge

5155 Posts

Posted - 08/27/2012 :  07:06:08  Show Profile  Reply with Quote
quote:
--Edited to add: After a bit of Googling, I think this is what was done to the original data: http://stackoverflow.com/questions/4378698/return-all-possible-combinations-of-values-on-columns-in-sql
I don't see how the data can be recovered. For example, consider these two cases, both of which should generate the same results if you were to run through the query posted in the link. So then given only the results of the query, it seems to me that the original information is lost forever and unrecoverable.
CREATE TABLE #tmp1 (col1 INT, col2 INT);
INSERT INTO #tmp1 VALUES (1,2),(3,4);

CREATE TABLE #tmp2 (col1 INT, col2 INT);
INSERT INTO #tmp2 VALUES (2,3),(1,4);
Go to Top of Page

LearningAsIGo
Starting Member

9 Posts

Posted - 08/28/2012 :  15:56:39  Show Profile  Reply with Quote
Thank you for all your time, Sunita.

I've requested better data from the client and your explanations helped me formulate the problem more clearly than I might otherwise have done.
Go to Top of Page

ejm
Starting Member

Ireland
1 Posts

Posted - 03/07/2014 :  11:57:40  Show Profile  Reply with Quote
Hi
I tried Sunita's code fragment with the following set:

(1,2)
(2,3)
(2,4)
(1,4)
(4,8)
(1,9)
(10,11)
(11,32)
(14,32)

Results are :
rn id
1 1
1 2
1 3
1 4
1 8
1 9
2 10
2 11
2 32
3 14
3 32

This does not show all the relationships.
The above results do not show that 10,11,32 and 14 are all related in a single group.

I want to see results
1,2,3,4,8,9
10,11,14,32
and if i added another association such as 32,1 everything should appear in a single related set.

Can you help with how I might achieve this? This is what I thought the original post from LearningAsIGo was asking...

Thanks in advance !


quote:
Originally posted by sunitabeck

Very nice posting!! So easy to copy the DDL's and see what you are trying to accomplish. Would this do what you are trying to accomplish?
;WITH cte AS
(
	SELECT id1,id2, DENSE_RANK() OVER (ORDER BY id1) AS RN FROM #Associations a
	WHERE NOT EXISTS 
	(SELECT * FROM #Associations b WHERE a.id1 = b.id2)
	
	UNION ALL
	
	SELECT a.id1,a.id2, RN FROM #Associations a
	INNER JOIN cte c ON c.id2 = a.Id1
)
SELECT id1,rn FROM cte UNION SELECT id2,rn FROM cte order by rn;


Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.11 seconds. Powered By: Snitz Forums 2000