Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
andreibrdn
Starting Member
3 Posts |
Posted - 2011-07-25 : 10:28:17
|
| Hello guys,I have the following table with the following data:create table tbl1(ClusterID VARCHAR(20) null,ID1 VARCHAR(20) null,ID2 VARCHAR(20) null)GOinsert into tbl1 tbl1 values(null, '1', '2');insert into tbl1 tbl1 values(null, '2', '3');insert into tbl1 tbl1 values(null, '1', '3');insert into tbl1 tbl1 values(null, '4', '5');insert into tbl1 tbl1 values(null, '5', '6');insert into tbl1 tbl1 values(null, '6', '7');GOIs there a way to generate some sort of a Cluster ID based on the columns ID1 and ID2?The result should be something like this:ClusterID ID1 ID2-------------------------CL1 1 2CL1 2 3CL1 1 3CL2 4 5CL2 5 6CL2 6 7Any help is much appreciated.Thank you!Andrei |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2011-07-25 : 10:51:50
|
| What is your criteria for generating the ClusterID? |
 |
|
|
visakh16
Very Important crosS Applying yaK Herder
52326 Posts |
Posted - 2011-07-25 : 11:13:42
|
| is the rule like one new ID per 3 rows? I cant spot any other grouping criteria------------------------------------------------------------------------------------------------------SQL Server MVPhttp://visakhm.blogspot.com/ |
 |
|
|
andreibrdn
Starting Member
3 Posts |
Posted - 2011-07-25 : 11:56:37
|
| Hi robvolk and visakh16,Let's take the first row from the generated output.ID1 = 1, ID2 = 2 so the CLuster should be CL1.Second row:ID1 = 2, ID2 = 3, but 2 has been already found in row 1 (ID2 = 2), so the cluster should be the same i.e. CL1.Third row:ID1 = 1, ID2 = 3 , 1 is found in row 1, 3 is found in row2, so this means that the cluster should be CL1 as well.Next row::ID1 = 4, ID2 = 5. Neither can be found among the first three rows, so I should generate a new ID => CL2and so on..Thanks for your help. I hope I was clear. |
 |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2011-07-25 : 13:17:18
|
| The problem with that scheme is that it's based on physical order, which is immaterial in a relational database (only values matter). By the way, what happens if you delete rows that would break a sequence? Do the remaining cluster IDs get recalculated?Even if you could automate a way to do this, performance will suffer as the data set gets larger. |
 |
|
|
andreibrdn
Starting Member
3 Posts |
Posted - 2011-07-26 : 02:06:01
|
| Well, this is a table generated based on two other tables. It is actually a result after matching two tables with company information. I have to consolidate these two tables and remove any duplicates. Also, I have to generate unique IDs for each pair of duplicates. The table generated will not be used on the production, so there is no risk - I will not delete records so there is no further process to recalculate the Cluster ID.I will not give you an example with data from these two tables.Table 1:Company name1----------------AlfaAlfa IncBeta 1Beta CoBeta New YorkTable 2:Company name 2Alfa IncAlfa IncorporatedBeta CoBeta New YorkBeta IncUsing some fuzzy algorithms, I have matches the company names of these two tables and got the following table:Company name1 Company name 2-----------------------------------Alfa Alfa IncAlfa Inc Alfa IncorporatedAlfa Alfa IncorporatedBeta 1 Beta CoBeta Co Beta New YorkBeta New York Beta IncNow, what I would like to do is give some sort of unique ID. You will see that the three rows of the last table contain actually the same company, so all three records should have the same Cluster ID. It does not matter if the format is CL1 or 1 or 11. Then, I should look at the next set of duplicated company names and generate another unique ID.Thanks!Andrei |
 |
|
|
LoztInSpace
Aged Yak Warrior
940 Posts |
Posted - 2011-07-26 : 07:40:23
|
| Sounds like a job for row_number() or rank() partitioned by something. I'm not entirely sure though. |
 |
|
|
robvolk
Most Valuable Yak
15732 Posts |
Posted - 2011-07-26 : 09:34:28
|
My suggestion is to combine the company names into a single list. That way you're not comparing 2 or more of the same company in 2 different tables to generate the same matches. This should get you started:SELECT Company INTO UniqueCompaniesFROM Table1UNIONSELECT Company FROM Table2 Then use fuzzy matching to find duplicates between rows of the UniqueCompanies table. You can add a FuzzyMatch column, or something similar, to identify the unique names and use that as your ClusterID. |
 |
|
|
|
|
|
|
|