Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 extracting unique row from a massive database

Author  Topic 

c.vaibhav
Starting Member

26 Posts

Posted - 2009-07-24 : 09:57:07
I have a table with 200 rows and 100 columns..

I want to extract unique rows from this table such that no two row are more than 60% similar.. this 60% is a variable and would be user defined..

I have done it in excel.. my logic goes as follows:

1st I compare Row 2 with Row 1 and find out its similarity with it.. If it is more than what is desired I discard this row else I add it to an array.. I move on to the next row and compare it with all the rows present in the array.. if the existing row is similar to any of the row present in the array then it is ruled out else it is added into the array..

I actually have to perform this test on a data containing more than 65000 rows and 200 columns so I need an SQL query that will make my work faster..

Logic of comparison: Column value in each row is compared with the corresponding column value in the relevant rows (added into the array) and is given a relevancy percentage based on that. For example if 100 out of the 200 columns are similar between any two rows then they are 50% similar.

I have posted the same query in daniweb, sqlservercentral forums.. I'll keep you updated if I get to know the solution..

[url]http://www.sqlservercentral.com/Forums/FindPost758641.aspx[/url]

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2009-07-24 : 10:46:10
I don't think anyone can answer that without seeing the table definition and the exact criteria used.

might want to look at soundex(), difference(), checksum_agg() functions
Go to Top of Page
   

- Advertisement -