Generate a Cluster ID based on two columns

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Transact-SQL (2008)

Generate a Cluster ID based on two columns

Author

Topic

andreibrdn
Starting Member

3 Posts

Posted - 2011-07-25 : 10:28:17

Hello guys,

I have the following table with the following data:

create table tbl1
(
ClusterID VARCHAR(20) null,
ID1 VARCHAR(20) null,
ID2 VARCHAR(20) null
)
GO

insert into tbl1 tbl1 values
(null, '1', '2');
insert into tbl1 tbl1 values
(null, '2', '3');
insert into tbl1 tbl1 values
(null, '1', '3');
insert into tbl1 tbl1 values
(null, '4', '5');
insert into tbl1 tbl1 values
(null, '5', '6');
insert into tbl1 tbl1 values
(null, '6', '7');
GO

Is there a way to generate some sort of a Cluster ID based on the columns ID1 and ID2?

The result should be something like this:

ClusterID ID1 ID2
-------------------------
CL1 1 2
CL1 2 3
CL1 1 3
CL2 4 5
CL2 5 6
CL2 6 7

Any help is much appreciated.

Thank you!
Andrei

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-07-25 : 10:51:50

What is your criteria for generating the ClusterID?

visakh16
Very Important crosS Applying yaK Herder

52326 Posts

Posted - 2011-07-25 : 11:13:42

is the rule like one new ID per 3 rows? I cant spot any other grouping criteria

------------------------------------------------------------------------------------------------------
SQL Server MVP
http://visakhm.blogspot.com/

andreibrdn
Starting Member

3 Posts

Posted - 2011-07-25 : 11:56:37

Hi robvolk and visakh16,

Let's take the first row from the generated output.

ID1 = 1, ID2 = 2 so the CLuster should be CL1.

Second row:

ID1 = 2, ID2 = 3, but 2 has been already found in row 1 (ID2 = 2), so the cluster should be the same i.e. CL1.

Third row:

ID1 = 1, ID2 = 3 , 1 is found in row 1, 3 is found in row2, so this means that the cluster should be CL1 as well.

Next row::

ID1 = 4, ID2 = 5. Neither can be found among the first three rows, so I should generate a new ID => CL2

and so on..

Thanks for your help. I hope I was clear.

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-07-25 : 13:17:18

The problem with that scheme is that it's based on physical order, which is immaterial in a relational database (only values matter). By the way, what happens if you delete rows that would break a sequence? Do the remaining cluster IDs get recalculated?

Even if you could automate a way to do this, performance will suffer as the data set gets larger.

andreibrdn
Starting Member

3 Posts

Posted - 2011-07-26 : 02:06:01

Well, this is a table generated based on two other tables. It is actually a result after matching two tables with company information. I have to consolidate these two tables and remove any duplicates. Also, I have to generate unique IDs for each pair of duplicates. The table generated will not be used on the production, so there is no risk - I will not delete records so there is no further process to recalculate the Cluster ID.

I will not give you an example with data from these two tables.

Table 1:

Company name1
----------------
Alfa
Alfa Inc
Beta 1
Beta Co
Beta New York

Table 2:

Company name 2
Alfa Inc
Alfa Incorporated
Beta Co
Beta New York
Beta Inc

Using some fuzzy algorithms, I have matches the company names of these two tables and got the following table:

Company name1 Company name 2
-----------------------------------
Alfa Alfa Inc
Alfa Inc Alfa Incorporated
Alfa Alfa Incorporated
Beta 1 Beta Co
Beta Co Beta New York
Beta New York Beta Inc

Now, what I would like to do is give some sort of unique ID. You will see that the three rows of the last table contain actually the same company, so all three records should have the same Cluster ID. It does not matter if the format is CL1 or 1 or 11. Then, I should look at the next set of duplicated company names and generate another unique ID.

Thanks!
Andrei

LoztInSpace
Aged Yak Warrior

940 Posts

Posted - 2011-07-26 : 07:40:23

Sounds like a job for row_number() or rank() partitioned by something. I'm not entirely sure though.

robvolk
Most Valuable Yak

15732 Posts

Posted - 2011-07-26 : 09:34:28

My suggestion is to combine the company names into a single list. That way you're not comparing 2 or more of the same company in 2 different tables to generate the same matches. This should get you started:

SELECT Company 
INTO UniqueCompanies
FROM Table1
UNION
SELECT Company FROM Table2

Then use fuzzy matching to find duplicates between rows of the UniqueCompanies table. You can add a FuzzyMatch column, or something similar, to identify the unique names and use that as your ClusterID.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources