Recursive query without a hierarchy - .5 mill recs

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Transact-SQL (2008)

Recursive query without a hierarchy - .5 mill recs

Author

Topic

LearningAsIGo
Starting Member

9 Posts

Posted - 2012-08-23 : 06:56:39

Hi,

I've inherited a table with data that needs to be grouped.

In each row, there are 2 IDs that are related - Id1 and Id2. They are both integers, and, in each row, Id1 is *always* less than Id2.

It is possible for IDs in different rows to be related to each other, which can be tracked by either Row1.Id1 or Row1.Id2 appearing in one of the Id fields of Row2, Row3, etc.

I need to track all those relations and group all related IDs under the same common GroupId. For example, Row1 might have Id1=20 and Id2=30. Row2 might have Id1=30 and Id2= 40. Since "30" shows up in both rows, then all the IDs are related, and I'd like to group them together in a new table that has a new GroupId (an identity column) and the ID itself. Please see below for a SQL sample.

I was thinking of using a CTE, but there is no hierarchy to the data, and I'm not sure what to try next.

If anybody has the time to take a look, I would love to hear some ideas. Thank you!


-- Set up sample data
CREATE TABLE #Associations
(Id1 INT,
Id2 INT)

INSERT INTO #Associations 
SELECT 100,200 UNION ALL
SELECT 100,300 UNION ALL
SELECT 100,400 UNION ALL 
SELECT 200,500 UNION ALL
SELECT 200,600 UNION ALL
SELECT 300,700 UNION ALL
SELECT 600,800 UNION ALL
SELECT 900,1000 UNION ALL
SELECT 1000,1100 UNION ALL
SELECT 1100,1200 UNION ALL
SELECT 1200,1300 UNION ALL
SELECT 1400,1500 


-- Show mocked-up version of the expected result

CREATE TABLE #ExpectedResult
(GroupId INT,
Id INT) 

INSERT INTO #ExpectedResult
SELECT 1,100 UNION ALL
SELECT 1,200 UNION ALL 
SELECT 1,300 UNION ALL
SELECT 1,400 UNION ALL
SELECT 1,500 UNION ALL
SELECT 1,600 UNION ALL 
SELECT 1,700 UNION ALL
SELECT 1,800 UNION ALL
SELECT 2,900     UNION ALL
SELECT 2,1000	   UNION ALL
SELECT 2,1100	   UNION ALL
SELECT 2,1200	   UNION ALL
SELECT 2,1300	   UNION ALL
SELECT 3,1400	   UNION ALL
SELECT 3,1500   

-- See both tables

SELECT * FROM #Associations
SELECT * FROM #ExpectedResult

-- Clean up

--DROP TABLE #Associations 
--DROP TABLE #ExpectedResult

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-23 : 07:48:00

Very nice posting!! So easy to copy the DDL's and see what you are trying to accomplish. Would this do what you are trying to accomplish?

;WITH cte AS
(
	SELECT id1,id2, DENSE_RANK() OVER (ORDER BY id1) AS RN FROM #Associations a
	WHERE NOT EXISTS 
	(SELECT * FROM #Associations b WHERE a.id1 = b.id2)
	
	UNION ALL
	
	SELECT a.id1,a.id2, RN FROM #Associations a
	INNER JOIN cte c ON c.id2 = a.Id1
)
SELECT id1,rn FROM cte UNION SELECT id2,rn FROM cte order by rn;

LearningAsIGo
Starting Member

9 Posts

Posted - 2012-08-23 : 16:05:26

Hi Sunita,

Thank you for your time and for the compliment. I am really grateful to all the experts here in the forum and try my best not to waste their/your time with badly formatted questions :).

The query definitely works with the small sample set and is what I was looking for, but it doesn't scale well with the 500,000 records in my table. So far, it has taken 24 minutes and is still running. I'm going to let it run a little longer and see what happens.

Thanks again, Sunita.

LearningAsIGo
Starting Member

9 Posts

Posted - 2012-08-24 : 09:05:08

Update: I let the query run to about an hour and then stopped it because it would take too long for the requirements under which the query has to be used.

If there are any suggestions on how to approach this for a large record set, I would be very grateful.

And, thank you, Sunita, for showing how to do a recursive CTE without a hierarchy. I didn't even know that was possible.

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-24 : 09:14:07

Do you have any indexes on the table? If you don't and if you are able to add indexes on ID1 and/or ID2, that would help, I think.

LearningAsIGo
Starting Member

9 Posts

Posted - 2012-08-24 : 09:15:40

Yep, there are indexes on Id1 and Id2.

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-24 : 09:27:33

I don't think this will be any faster - but another thought. On smaller sets, this would certainly perform slower, but there is a possibility that it may be better for larger sets. Also, it is destructive, so save your original table.

CREATE TABLE #tmp (id1 INT, id2 INT, RN INT);

DELETE FROM a
OUTPUT DELETED.id1,DELETED.id2,NULL
INTO #tmp (id1,id2,RN)
FROM
	#Associations a
WHERE
	NOT EXISTS 
	( SELECT * FROM #Associations b WHERE a.id1 = b.id2 ); 
	
;WITH cte AS
(SELECT RN, DENSE_RANK() OVER (ORDER BY id1) AS RN2 FROM #tmp)
UPDATE cte SET RN = RN2;

DECLARE @rows INT = 1;
WHILE (@rows > 0)
BEGIN
	DELETE FROM a
	OUTPUT DELETED.id1,DELETED.id2,c.RN
	INTO #tmp (id1,id2,RN)
	FROM
		#Associations a
		INNER JOIN  #tmp c ON c.id2 = a.Id1
	SET @rows = @@ROWCOUNT;
END

SELECT id1,rn FROM #tmp UNION SELECT id2,rn FROM #tmp ORDER BY 2,1

LearningAsIGo
Starting Member

9 Posts

Posted - 2012-08-25 : 08:36:54

Wow! That took 10 seconds. I am flabbergasted and very humbled & impressed. My jaw literally dropped. Thank you so much.

There is one issue, though, that came up with 8% of the IDs because my sample data above didn't take into account one scenario.

The issue occurs when one row's Id2 exists elsewhere, but the row's Id1 doesn't exist anywhere else. In this instance, the existing Id's existing RN doesn't get reused, but an entirely new RN is calculated.

Here's an example to clarify. I've added two records to the insert - rows D and E - and also added the provided solution for easier copying.

Let's start with Row D.

In Row D, I added Id1 = 150 and Id2 = 200. Since the id value "200" already exists in Row A, the RN from Row A should be used here. Instead, a new RN gets calculated. The same thing happens in Row E - a new RN is calculated. This new RN is then applied to all subsequent IDs that are linked to "200".

Is there a way to address this? I know I've already taken up a lot of your time and I apologize for the incomplete sample data.


CREATE TABLE #Associations
(Id1 INT,
Id2 INT)

INSERT INTO #Associations 
/*A*/ SELECT 100,200 UNION ALL
/*B*/ SELECT 100,300 UNION ALL
/*C*/ SELECT 100,400 UNION ALL 

/* New records below */
/*D*/ SELECT 150,200 UNION ALL 
/*E*/ SELECT 160,200 UNION ALL
/* New records above */

/*F*/ SELECT 200,500 UNION ALL
/*G*/ SELECT 200,600 UNION ALL
/*H*/ SELECT 300,700 UNION ALL
/*I*/ SELECT 600,800 UNION ALL
/*J*/ SELECT 900,1000 UNION ALL
/*K*/ SELECT 1000,1100 UNION ALL
/*L*/ SELECT 1100,1200 UNION ALL
/*M*/ SELECT 1200,1300 UNION ALL
/*N*/ SELECT 1400,1500 

 

---- Table backup (quick & dirty)

CREATE TABLE #Associations_Backup (id1 INT, id2 INT)
INSERT INTO #Associations_Backup (id1, id2)
SELECT id1, id2 FROM #Associations

---- Sunita's speedy solution 

CREATE TABLE #tmp (id1 INT, id2 INT, RN INT);

DELETE FROM a
OUTPUT DELETED.id1,DELETED.id2,NULL
INTO #tmp (id1,id2,RN)
FROM
	#Associations a
WHERE
	NOT EXISTS 
	( SELECT * FROM #Associations b WHERE a.id1 = b.id2 ); 
	
;WITH cte AS
(SELECT RN, DENSE_RANK() OVER (ORDER BY id1) AS RN2 FROM #tmp)
UPDATE cte SET RN = RN2;

DECLARE @rows INT = 1;
WHILE (@rows > 0)
BEGIN
	DELETE FROM a
	OUTPUT DELETED.id1,DELETED.id2,c.RN
	INTO #tmp (id1,id2,RN)
	FROM
		#Associations a
		INNER JOIN  #tmp c ON c.id2 = a.Id1
	SET @rows = @@ROWCOUNT;
END

SELECT id1,rn FROM #tmp UNION SELECT id2,rn FROM #tmp ORDER BY 2,1

---- Clean up

 --DROP TABLE #Associations
 --DROP TABLE #Associations_Backup
 --DROP TABLE #tmp

joe8079
Posting Yak Master

127 Posts

Posted - 2012-08-25 : 15:50:57

Question:

I always thought while loops were considered evil on this board so I'm kind of confused as to how this would work or why one should use a while loop here:

WHILE (@rows > 0)
BEGIN
DELETE FROM a
OUTPUT DELETED.id1,DELETED.id2,c.RN
INTO #tmp (id1,id2,RN)
FROM
#Associations a
INNER JOIN #tmp c ON c.id2 = a.Id1
SET @rows = @@ROWCOUNT;
END

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-26 : 09:52:35

I would not have expected that much of a change in performance either. Now I am suspecting whether there is something in the data that causes circular references or other things that cause the CTE to go into more recursions than we are expecting it to.

In any case, following through with the example that you posted, is it possible that there are multiple linkages - for example, if you were to add another row with id1= 160, id2=1000 (and continuing on for example by adding id1=1400, id2=1000)? Assuming that that is not the case, if you change the group updating as shown below, it should group the data correctly.

;WITH cte AS
(SELECT RN, DENSE_RANK() OVER (ORDER BY id1)
	- ROW_NUMBER() OVER(PARTITION BY id2 ORDER BY id1) AS RN2 FROM #tmp)
UPDATE cte SET RN = RN2;

This would assign unique group ids, but with gaps. If you do need the group id's to be sequential, then:

;WITH cte AS
(SELECT RN, DENSE_RANK() OVER (ORDER BY id1)
	- ROW_NUMBER() OVER(PARTITION BY id2 ORDER BY id1) AS RN2 FROM #tmp),
cte2 AS 
(SELECT *,DENSE_RANK() OVER (ORDER BY RN2) AS RN3 FROM cte)
UPDATE cte2 SET RN = RN3;

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-26 : 09:56:36

quote:
Originally posted by joe8079

Question:

I always thought while loops were considered evil on this board so I'm kind of confused as to how this would work or why one should use a while loop here:

WHILE (@rows > 0)
BEGIN
DELETE FROM a
OUTPUT DELETED.id1,DELETED.id2,c.RN
INTO #tmp (id1,id2,RN)
FROM
#Associations a
INNER JOIN #tmp c ON c.id2 = a.Id1
SET @rows = @@ROWCOUNT;
END

WHILE loops are indeed evil in most cases but it is possible that one or more of the following contribute to better performance in this case:

1. There is something in the data that causes the recursive query to go into infinite recursions or collecting more data than we are expecting in each level such that the number of rows in successive recursions are growing exponentially.

2. In the while loop query, because we are deleting the data from the table, the size progressively decreases.

3. The while loop query is multiple statements, but the recursive query is a single statement. So many smaller transactions vs one large transaction making the smaller transactions run faster.

I am just guessing - don't really know which if any of these are causing the better performance.

LearningAsIGo
Starting Member

9 Posts

Posted - 2012-08-26 : 17:08:45

I was wondering about the miracle loop, too, so thank you, joe, for asking and Sunita for taking the time to explain.

quote:
Originally posted by sunitabeck

In any case, following through with the example that you posted, is it possible that there are multiple linkages - for example, if you were to add another row with id1= 160, id2=1000 (and continuing on for example by adding id1=1400, id2=1000)?

I just did some analysis on the data and there is a lot of that happening, with the caveat that Id1 is always less than Id2.

The following happens quite frequently:
If RowA(Id1,Id2) = 2,3 and RowB(Id1,Id2) = 2,4, then RowJ(Id1,Id2) = 3,4. For instances where any given Id1 has a bunch of Id2s, many of the Id2s pair up by themselves later on.

--Edited to add: After a bit of Googling, I think this is what was done to the original data: http://stackoverflow.com/questions/4378698/return-all-possible-combinations-of-values-on-columns-in-sql

Is there a way to undo that?
--End of edit

At this point, would it be better just go back to the client and request better data or is there any way to beat the circular references? Knowing this client, though, I'll probably end up right back here in a couple of weeks with more questions about how to make sense of the new stuff :).

sunitabeck
Master Smack Fu Yak Hacker

5155 Posts

Posted - 2012-08-27 : 07:06:08

quote:
--Edited to add: After a bit of Googling, I think this is what was done to the original data: http://stackoverflow.com/questions/4378698/return-all-possible-combinations-of-values-on-columns-in-sql

I don't see how the data can be recovered. For example, consider these two cases, both of which should generate the same results if you were to run through the query posted in the link. So then given only the results of the query, it seems to me that the original information is lost forever and unrecoverable.

CREATE TABLE #tmp1 (col1 INT, col2 INT);
INSERT INTO #tmp1 VALUES (1,2),(3,4);

CREATE TABLE #tmp2 (col1 INT, col2 INT);
INSERT INTO #tmp2 VALUES (2,3),(1,4);

LearningAsIGo
Starting Member

9 Posts

Posted - 2012-08-28 : 15:56:39

Thank you for all your time, Sunita.

I've requested better data from the client and your explanations helped me formulate the problem more clearly than I might otherwise have done.

ejm
Starting Member

1 Post

Posted - 2014-03-07 : 11:57:40

Hi
I tried Sunita's code fragment with the following set:

(1,2)
(2,3)
(2,4)
(1,4)
(4,8)
(1,9)
(10,11)
(11,32)
(14,32)

Results are :
rn id
1 1
1 2
1 3
1 4
1 8
1 9
2 10
2 11
2 32
3 14
3 32

This does not show all the relationships.
The above results do not show that 10,11,32 and 14 are all related in a single group.

I want to see results
1,2,3,4,8,9
10,11,14,32
and if i added another association such as 32,1 everything should appear in a single related set.

Can you help with how I might achieve this? This is what I thought the original post from LearningAsIGo was asking...

Thanks in advance !

quote:
Originally posted by sunitabeck

Very nice posting!! So easy to copy the DDL's and see what you are trying to accomplish. Would this do what you are trying to accomplish?
;WITH cte AS
(
SELECT id1,id2, DENSE_RANK() OVER (ORDER BY id1) AS RN FROM #Associations a
WHERE NOT EXISTS
(SELECT * FROM #Associations b WHERE a.id1 = b.id2)

UNION ALL

SELECT a.id1,a.id2, RN FROM #Associations a
INNER JOIN cte c ON c.id2 = a.Id1
)
SELECT id1,rn FROM cte UNION SELECT id2,rn FROM cte order by rn;

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources