Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
LearningAsIGo
Starting Member
9 Posts |
Posted - 2012-08-23 : 06:56:39
|
Hi, I've inherited a table with data that needs to be grouped. In each row, there are 2 IDs that are related - Id1 and Id2. They are both integers, and, in each row, Id1 is *always* less than Id2. It is possible for IDs in different rows to be related to each other, which can be tracked by either Row1.Id1 or Row1.Id2 appearing in one of the Id fields of Row2, Row3, etc. I need to track all those relations and group all related IDs under the same common GroupId. For example, Row1 might have Id1=20 and Id2=30. Row2 might have Id1=30 and Id2= 40. Since "30" shows up in both rows, then all the IDs are related, and I'd like to group them together in a new table that has a new GroupId (an identity column) and the ID itself. Please see below for a SQL sample.I was thinking of using a CTE, but there is no hierarchy to the data, and I'm not sure what to try next.If anybody has the time to take a look, I would love to hear some ideas. Thank you! -- Set up sample dataCREATE TABLE #Associations(Id1 INT,Id2 INT)INSERT INTO #Associations SELECT 100,200 UNION ALLSELECT 100,300 UNION ALLSELECT 100,400 UNION ALL SELECT 200,500 UNION ALLSELECT 200,600 UNION ALLSELECT 300,700 UNION ALLSELECT 600,800 UNION ALLSELECT 900,1000 UNION ALLSELECT 1000,1100 UNION ALLSELECT 1100,1200 UNION ALLSELECT 1200,1300 UNION ALLSELECT 1400,1500 -- Show mocked-up version of the expected resultCREATE TABLE #ExpectedResult(GroupId INT,Id INT) INSERT INTO #ExpectedResultSELECT 1,100 UNION ALLSELECT 1,200 UNION ALL SELECT 1,300 UNION ALLSELECT 1,400 UNION ALLSELECT 1,500 UNION ALLSELECT 1,600 UNION ALL SELECT 1,700 UNION ALLSELECT 1,800 UNION ALLSELECT 2,900 UNION ALLSELECT 2,1000 UNION ALLSELECT 2,1100 UNION ALLSELECT 2,1200 UNION ALLSELECT 2,1300 UNION ALLSELECT 3,1400 UNION ALLSELECT 3,1500 -- See both tablesSELECT * FROM #AssociationsSELECT * FROM #ExpectedResult-- Clean up--DROP TABLE #Associations --DROP TABLE #ExpectedResult |
|
sunitabeck
Master Smack Fu Yak Hacker
5155 Posts |
Posted - 2012-08-23 : 07:48:00
|
Very nice posting!! So easy to copy the DDL's and see what you are trying to accomplish. Would this do what you are trying to accomplish?;WITH cte AS( SELECT id1,id2, DENSE_RANK() OVER (ORDER BY id1) AS RN FROM #Associations a WHERE NOT EXISTS (SELECT * FROM #Associations b WHERE a.id1 = b.id2) UNION ALL SELECT a.id1,a.id2, RN FROM #Associations a INNER JOIN cte c ON c.id2 = a.Id1)SELECT id1,rn FROM cte UNION SELECT id2,rn FROM cte order by rn; |
|
|
LearningAsIGo
Starting Member
9 Posts |
Posted - 2012-08-23 : 16:05:26
|
Hi Sunita,Thank you for your time and for the compliment. I am really grateful to all the experts here in the forum and try my best not to waste their/your time with badly formatted questions :).The query definitely works with the small sample set and is what I was looking for, but it doesn't scale well with the 500,000 records in my table. So far, it has taken 24 minutes and is still running. I'm going to let it run a little longer and see what happens. Thanks again, Sunita. |
|
|
LearningAsIGo
Starting Member
9 Posts |
Posted - 2012-08-24 : 09:05:08
|
Update: I let the query run to about an hour and then stopped it because it would take too long for the requirements under which the query has to be used. If there are any suggestions on how to approach this for a large record set, I would be very grateful. And, thank you, Sunita, for showing how to do a recursive CTE without a hierarchy. I didn't even know that was possible. |
|
|
sunitabeck
Master Smack Fu Yak Hacker
5155 Posts |
Posted - 2012-08-24 : 09:14:07
|
Do you have any indexes on the table? If you don't and if you are able to add indexes on ID1 and/or ID2, that would help, I think. |
|
|
LearningAsIGo
Starting Member
9 Posts |
Posted - 2012-08-24 : 09:15:40
|
Yep, there are indexes on Id1 and Id2. |
|
|
sunitabeck
Master Smack Fu Yak Hacker
5155 Posts |
Posted - 2012-08-24 : 09:27:33
|
I don't think this will be any faster - but another thought. On smaller sets, this would certainly perform slower, but there is a possibility that it may be better for larger sets. Also, it is destructive, so save your original table.CREATE TABLE #tmp (id1 INT, id2 INT, RN INT);DELETE FROM aOUTPUT DELETED.id1,DELETED.id2,NULLINTO #tmp (id1,id2,RN)FROM #Associations aWHERE NOT EXISTS ( SELECT * FROM #Associations b WHERE a.id1 = b.id2 ); ;WITH cte AS(SELECT RN, DENSE_RANK() OVER (ORDER BY id1) AS RN2 FROM #tmp)UPDATE cte SET RN = RN2;DECLARE @rows INT = 1;WHILE (@rows > 0)BEGIN DELETE FROM a OUTPUT DELETED.id1,DELETED.id2,c.RN INTO #tmp (id1,id2,RN) FROM #Associations a INNER JOIN #tmp c ON c.id2 = a.Id1 SET @rows = @@ROWCOUNT;ENDSELECT id1,rn FROM #tmp UNION SELECT id2,rn FROM #tmp ORDER BY 2,1 |
|
|
LearningAsIGo
Starting Member
9 Posts |
Posted - 2012-08-25 : 08:36:54
|
Wow! That took 10 seconds. I am flabbergasted and very humbled & impressed. My jaw literally dropped. Thank you so much. There is one issue, though, that came up with 8% of the IDs because my sample data above didn't take into account one scenario. The issue occurs when one row's Id2 exists elsewhere, but the row's Id1 doesn't exist anywhere else. In this instance, the existing Id's existing RN doesn't get reused, but an entirely new RN is calculated. Here's an example to clarify. I've added two records to the insert - rows D and E - and also added the provided solution for easier copying.Let's start with Row D.In Row D, I added Id1 = 150 and Id2 = 200. Since the id value "200" already exists in Row A, the RN from Row A should be used here. Instead, a new RN gets calculated. The same thing happens in Row E - a new RN is calculated. This new RN is then applied to all subsequent IDs that are linked to "200".Is there a way to address this? I know I've already taken up a lot of your time and I apologize for the incomplete sample data. CREATE TABLE #Associations(Id1 INT,Id2 INT)INSERT INTO #Associations /*A*/ SELECT 100,200 UNION ALL/*B*/ SELECT 100,300 UNION ALL/*C*/ SELECT 100,400 UNION ALL /* New records below *//*D*/ SELECT 150,200 UNION ALL /*E*/ SELECT 160,200 UNION ALL/* New records above *//*F*/ SELECT 200,500 UNION ALL/*G*/ SELECT 200,600 UNION ALL/*H*/ SELECT 300,700 UNION ALL/*I*/ SELECT 600,800 UNION ALL/*J*/ SELECT 900,1000 UNION ALL/*K*/ SELECT 1000,1100 UNION ALL/*L*/ SELECT 1100,1200 UNION ALL/*M*/ SELECT 1200,1300 UNION ALL/*N*/ SELECT 1400,1500 ---- Table backup (quick & dirty)CREATE TABLE #Associations_Backup (id1 INT, id2 INT)INSERT INTO #Associations_Backup (id1, id2)SELECT id1, id2 FROM #Associations---- Sunita's speedy solution CREATE TABLE #tmp (id1 INT, id2 INT, RN INT);DELETE FROM aOUTPUT DELETED.id1,DELETED.id2,NULLINTO #tmp (id1,id2,RN)FROM #Associations aWHERE NOT EXISTS ( SELECT * FROM #Associations b WHERE a.id1 = b.id2 ); ;WITH cte AS(SELECT RN, DENSE_RANK() OVER (ORDER BY id1) AS RN2 FROM #tmp)UPDATE cte SET RN = RN2;DECLARE @rows INT = 1;WHILE (@rows > 0)BEGIN DELETE FROM a OUTPUT DELETED.id1,DELETED.id2,c.RN INTO #tmp (id1,id2,RN) FROM #Associations a INNER JOIN #tmp c ON c.id2 = a.Id1 SET @rows = @@ROWCOUNT;ENDSELECT id1,rn FROM #tmp UNION SELECT id2,rn FROM #tmp ORDER BY 2,1---- Clean up --DROP TABLE #Associations --DROP TABLE #Associations_Backup --DROP TABLE #tmp |
|
|
joe8079
Posting Yak Master
127 Posts |
Posted - 2012-08-25 : 15:50:57
|
Question: I always thought while loops were considered evil on this board so I'm kind of confused as to how this would work or why one should use a while loop here: WHILE (@rows > 0)BEGIN DELETE FROM a OUTPUT DELETED.id1,DELETED.id2,c.RN INTO #tmp (id1,id2,RN) FROM #Associations a INNER JOIN #tmp c ON c.id2 = a.Id1 SET @rows = @@ROWCOUNT;END |
|
|
sunitabeck
Master Smack Fu Yak Hacker
5155 Posts |
Posted - 2012-08-26 : 09:52:35
|
I would not have expected that much of a change in performance either. Now I am suspecting whether there is something in the data that causes circular references or other things that cause the CTE to go into more recursions than we are expecting it to.In any case, following through with the example that you posted, is it possible that there are multiple linkages - for example, if you were to add another row with id1= 160, id2=1000 (and continuing on for example by adding id1=1400, id2=1000)? Assuming that that is not the case, if you change the group updating as shown below, it should group the data correctly.;WITH cte AS(SELECT RN, DENSE_RANK() OVER (ORDER BY id1) - ROW_NUMBER() OVER(PARTITION BY id2 ORDER BY id1) AS RN2 FROM #tmp)UPDATE cte SET RN = RN2; This would assign unique group ids, but with gaps. If you do need the group id's to be sequential, then:;WITH cte AS(SELECT RN, DENSE_RANK() OVER (ORDER BY id1) - ROW_NUMBER() OVER(PARTITION BY id2 ORDER BY id1) AS RN2 FROM #tmp),cte2 AS (SELECT *,DENSE_RANK() OVER (ORDER BY RN2) AS RN3 FROM cte)UPDATE cte2 SET RN = RN3; |
|
|
sunitabeck
Master Smack Fu Yak Hacker
5155 Posts |
Posted - 2012-08-26 : 09:56:36
|
quote: Originally posted by joe8079 Question: I always thought while loops were considered evil on this board so I'm kind of confused as to how this would work or why one should use a while loop here: WHILE (@rows > 0)BEGIN DELETE FROM a OUTPUT DELETED.id1,DELETED.id2,c.RN INTO #tmp (id1,id2,RN) FROM #Associations a INNER JOIN #tmp c ON c.id2 = a.Id1 SET @rows = @@ROWCOUNT;END
WHILE loops are indeed evil in most cases but it is possible that one or more of the following contribute to better performance in this case:1. There is something in the data that causes the recursive query to go into infinite recursions or collecting more data than we are expecting in each level such that the number of rows in successive recursions are growing exponentially.2. In the while loop query, because we are deleting the data from the table, the size progressively decreases.3. The while loop query is multiple statements, but the recursive query is a single statement. So many smaller transactions vs one large transaction making the smaller transactions run faster.I am just guessing - don't really know which if any of these are causing the better performance. |
|
|
LearningAsIGo
Starting Member
9 Posts |
Posted - 2012-08-26 : 17:08:45
|
I was wondering about the miracle loop, too, so thank you, joe, for asking and Sunita for taking the time to explain.quote: Originally posted by sunitabeck In any case, following through with the example that you posted, is it possible that there are multiple linkages - for example, if you were to add another row with id1= 160, id2=1000 (and continuing on for example by adding id1=1400, id2=1000)?
I just did some analysis on the data and there is a lot of that happening, with the caveat that Id1 is always less than Id2. The following happens quite frequently: If RowA(Id1,Id2) = 2,3 and RowB(Id1,Id2) = 2,4, then RowJ(Id1,Id2) = 3,4. For instances where any given Id1 has a bunch of Id2s, many of the Id2s pair up by themselves later on. --Edited to add: After a bit of Googling, I think this is what was done to the original data: http://stackoverflow.com/questions/4378698/return-all-possible-combinations-of-values-on-columns-in-sqlIs there a way to undo that?--End of edit At this point, would it be better just go back to the client and request better data or is there any way to beat the circular references? Knowing this client, though, I'll probably end up right back here in a couple of weeks with more questions about how to make sense of the new stuff :). |
|
|
sunitabeck
Master Smack Fu Yak Hacker
5155 Posts |
Posted - 2012-08-27 : 07:06:08
|
quote: --Edited to add: After a bit of Googling, I think this is what was done to the original data: http://stackoverflow.com/questions/4378698/return-all-possible-combinations-of-values-on-columns-in-sql
I don't see how the data can be recovered. For example, consider these two cases, both of which should generate the same results if you were to run through the query posted in the link. So then given only the results of the query, it seems to me that the original information is lost forever and unrecoverable.CREATE TABLE #tmp1 (col1 INT, col2 INT);INSERT INTO #tmp1 VALUES (1,2),(3,4);CREATE TABLE #tmp2 (col1 INT, col2 INT);INSERT INTO #tmp2 VALUES (2,3),(1,4); |
|
|
LearningAsIGo
Starting Member
9 Posts |
Posted - 2012-08-28 : 15:56:39
|
Thank you for all your time, Sunita. I've requested better data from the client and your explanations helped me formulate the problem more clearly than I might otherwise have done. |
|
|
ejm
Starting Member
1 Post |
Posted - 2014-03-07 : 11:57:40
|
Hi I tried Sunita's code fragment with the following set: (1,2)(2,3)(2,4)(1,4)(4,8)(1,9)(10,11)(11,32)(14,32)Results are :rn id 1 11 21 31 41 81 92 102 112 323 143 32This does not show all the relationships. The above results do not show that 10,11,32 and 14 are all related in a single group. I want to see results 1,2,3,4,8,9 10,11,14,32 and if i added another association such as 32,1 everything should appear in a single related set.Can you help with how I might achieve this? This is what I thought the original post from LearningAsIGo was asking...Thanks in advance ! quote: Originally posted by sunitabeck Very nice posting!! So easy to copy the DDL's and see what you are trying to accomplish. Would this do what you are trying to accomplish?;WITH cte AS( SELECT id1,id2, DENSE_RANK() OVER (ORDER BY id1) AS RN FROM #Associations a WHERE NOT EXISTS (SELECT * FROM #Associations b WHERE a.id1 = b.id2) UNION ALL SELECT a.id1,a.id2, RN FROM #Associations a INNER JOIN cte c ON c.id2 = a.Id1)SELECT id1,rn FROM cte UNION SELECT id2,rn FROM cte order by rn;
|
|
|
|
|
|
|
|