Hideous string difference query

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Hideous string difference query

Author

Topic

Jaeden
Starting Member

2 Posts

Posted - 2008-01-17 : 19:52:39

Greetings, SQL gurus.

I have a table (creatively named Testing at the moment) which I am querying via PHP. The table contains several columns, but there is one more important than the rest, an nvarchar description ("description"). I'm loading rows up into a Javascript text fader to show random samples of the rows. However, many of the "description" values are VERY close together - a few characters difference over 300 or 400 characters. It looks silly to fade one out and replace it with another, almost-the-same description, because unless the page visitor is looking very carefully, he will likely think it just showed the same description twice.

I'm trying to write a query which filters out the "similar" ones. Let me post a simplified example:

ID Desc
---------------------------
1 This is a neat widget!
2 This thingamabob is great!
3 I love the sexy doodad!
4 This is a cool widget!

If I were to run the desired query and give it, say, a rowcount of 2, I would want it to return rows 2 and 3, since 1 and 4 have other, more similar rows present. I've done about 2 hours of google fu, and determined that the DIFFERENCE() and/or SOUNDEX() functions in MSSQL are my salvation.

However, that seems only to compare two strings I provide. I'd thought to use the Group By or Order By (in conjunction with a rowcount) to select only the X number of "most unique" strings (say, the ones having a minimum DIFFERENCE() of 3 from ALL other rows in the column).

I've been beating my head against a blank query window for hours, and every join, group, order, case I write gets me no closer - I'm hesitant to post any queries because none of them were close enough to go, "Hey, I'm onto something here!".

Because it's a web-based app, I'd like to avoid creating a temp table if I can (for speed) but will do it if it's the only way.

Please advise, and thank you in advance for your assistance.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2008-01-18 : 02:09:26

Have you tried to use the CROSS APPLY together with a split function?

E 12°55'05.25"
N 56°04'39.16"

Jaeden
Starting Member

2 Posts

Posted - 2008-01-18 : 09:27:01

quote:
Originally posted by Peso

Have you tried to use the CROSS APPLY together with a split function?

Actually, I've never heard of such a thing. Could you give me an example of how I might use it?

harsh_athalye
Master Smack Fu Yak Hacker

5581 Posts

Posted - 2008-01-18 : 09:32:15

See here:
http://weblogs.sqlteam.com/jeffs/archive/2007/10/18/sql-server-cross-apply.aspx
http://www.sqlteam.com/article/using-cross-apply-in-sql-server-2005

Harsh Athalye
India.
"The IMPOSSIBLE is often UNTRIED"

blindman
Master Smack Fu Yak Hacker

2365 Posts

Posted - 2008-01-18 : 10:44:19

Jaeden, this fuzzy-search algorithm measures the similarity between different strings:
http://sqlblindman.googlepages.com/fuzzysearchalgorithm

Example:


select	*
into #Samples
from	(select	1 as ID, 'This is a neat widget!' as SampleString
union   select 2, 'This thingamabob is great!'
union   select 3, 'I love the sexy doodad!'
union   select 4, 'This is a cool widget!') Samples

select	A.SampleString,
		B.SampleString,
		dbo.CompareText(A.SampleString, B.SampleString) as Similarity
from	#Samples A,
		#Samples B
where	A.ID <> B.ID
order by dbo.CompareText(A.SampleString, B.SampleString) asc

drop table #Samples

--Output:
SampleString               SampleString               Similarity
-------------------------- -------------------------- -----------
I love the sexy doodad!    This is a neat widget!     4
This is a neat widget!     I love the sexy doodad!    4
This is a cool widget!     I love the sexy doodad!    9
I love the sexy doodad!    This is a cool widget!     9
This thingamabob is great! I love the sexy doodad!    10
I love the sexy doodad!    This thingamabob is great! 10
This is a cool widget!     This thingamabob is great! 39
This thingamabob is great! This is a cool widget!     39
This thingamabob is great! This is a neat widget!     47
This is a neat widget!     This thingamabob is great! 47
This is a cool widget!     This is a neat widget!     76
This is a neat widget!     This is a cool widget!     76

e4 d5 xd5 Nf6

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources