Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Finding a substring within a text

Author  Topic 

kaduri
Starting Member

7 Posts

Posted - 2007-11-07 : 06:01:54
Hello everyone,

I have 2 identical tables each containing 3 fields (ID, Date, Text). Each table holds news announcements from different publishers using slightly different presentations of identical announcements. I am trying to identify duplicate anouncements.

Example:

Table 1:
ID: 872
Date: 3/20/1999
Text: SomeTown, MA -- YYY, Inc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, today announced the 1234 Family of stand-alone workstations xxx is a registered trademark of yyy

Table 2
ID: 297
Date:3/20/1999
Text: March 3, 1999YYY Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the 1234 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments.


Important: The length and contents of the text in red is NOT constant. Only the green substring may appear in both entries (if the two are in fact duplicate anouncements).


I am using SQL Server 2005 Express on Windows XP

Any help/suggestion will be greatly appreciated.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-11-07 : 06:05:39
First filter by Date.
Second filter by DATALENGTH. If datalength is different, the records are not identical.

SELECT Date, DATALENGTH(Text) FRMO Table1 GROUP BY Date, DATALENGTH(Text) HAVING COUNT(*) > 1



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

kaduri
Starting Member

7 Posts

Posted - 2007-11-07 : 06:10:48
Hi Peso,

The datalength is not a valid indicator since the text in red is not constant, unknown, and different for each record. Only the green substring embedded within each announcement can be usd to determine whether the to records are in deed identical
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-11-07 : 06:25:01
Really?
If you want to go for IDENTICAL records, start out with those that are of same length and date.
Can two records of different length be identical?
Can two records of different dates be identical?

After finding records with same date and length, then it's time to see if the CONTENT of TEXT column is identical. I think you can do that with LIKE operator.


E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-11-07 : 06:32:51
I think I got what you are after, now.

How do you know that it is the complete green part that is identical?
Why not just the shorter "computer graphics workstations" part?



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-11-07 : 06:33:29
Also, is there a special reason to use TEXT in favor of VARCHAR(8000)?
Or even VARCHAR(MAX)?


E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

kaduri
Starting Member

7 Posts

Posted - 2007-11-07 : 06:42:23
The green part will exist in both records if they are to be considered identical. However, I do not now the text of the green part as this is to be identified. Once I now and can extract the the green part, I will be able to add additional criteria to making a decision (e.g. if the green part is more than 90% of the entire text, I will consider them to be identical.

In the Db, I am using nvarchar(max)
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-11-07 : 07:03:48
90% based on what?

Character matching? Word matching?
Must they be in same order?

Is
Inc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, today announced the 1234 Family of stand-alone workstations
the same as
Inc., manufacturer and marketer of interactivc, high-performance computer graphics workstations, today announced the 1234 Family of stand-alone workstations


E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

kaduri
Starting Member

7 Posts

Posted - 2007-11-07 : 07:26:50
Peso,
Thanks for your effort in trying to help. I truly appreciate it!

Here are two records that for my purposes are considered identical: Note the following (which makes this thing so complicated)
1. Both annoucnements have different introductions and endings
2. Within the text, sometimes spaces or other specil characters are dfferent. This however, should not be a problem as I can strip the text off any spaces and characters if neccessary)

The 90% thing was only an example of what I can do once I determine that entries such as these are identical (please ignore it for now)

Announcemenet 1:
Adage Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the Adage 6500 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin in December. OCEAN is a trademark of Adage Inc. MicroVAX and MicroVMS are trademarks of Digital Equipment Corp. PATRAN is a registered trademark of PDA Engineering. ANSYS is a registered trademark of Swanson Analysis Systems Inc. NASTRAN is a registered trademark of National Aeronautics and Space Administration. CADAM and IPC are registered trademarks of CADAM Inc. CATIA is a registered trademark of Dassault Systemes CAEDS is a trademark of SDRC. MSC/NASTRAN is a product of MacNeal-Schwendler Corp. CONTACT: Adage Inc., Billerica Linda Standford, 212/123-7070

-----------------------------------------------------------------
Announcemenet 2:
Contact: Linda Standford, (212) 123-7070 September 10, 1985ADAGE, INC. INTRODUCES 6500 WORKSTATION FAMILYOF STAND-ALONE WORKSTATIONSBILLERICA, MA -- Adage, Inc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, todayannounced the Adage 6500 Family of stand-alone workstations thatallows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin inDecember. OCEAN is a trademark of Adage, Inc.MicroVAX and MicroVMS are trademarks of Digital EquipmentCorporation. PATRAN is a registered trademark of PDA Engineering.ANSYS is a registered trademark of Swanson Analysis Systems, Inc.NASTRAN is a registered trademark of National Aeronautics and SpaceAdministration. CADAM and IPC are registered trademarks of CADAM INC.CATIA is a registered trademark of Dassault Systems.CAEDS is a trademark of SDRC.MSC/NASTRAN is a product of MacNeal-Schwendler Corporation. Linda Standford, 212/123-7070
Go to Top of Page

kaduri
Starting Member

7 Posts

Posted - 2007-11-07 : 07:40:01
My two tables contain thoussands of announcements and I need to run a query/procedure that will match entries on the Date fields between the two tables and then compare announcemenet swith the same date to SOMEHOW PROGRAMMATICALLY decide if they are identical.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-11-07 : 08:35:01
[code]DECLARE @Table1 TABLE (RecID INT PRIMARY KEY, Announcement VARCHAR(MAX))

INSERT @Table1
SELECT 1, 'Adage Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the Adage 6500 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin in December. OCEAN is a trademark of Adage Inc. MicroVAX and MicroVMS are trademarks of Digital Equipment Corp. PATRAN is a registered trademark of PDA Engineering. ANSYS is a registered trademark of Swanson Analysis Systems Inc. NASTRAN is a registered trademark of National Aeronautics and Space Administration. CADAM and IPC are registered trademarks of CADAM Inc. CATIA is a registered trademark of Dassault Systemes CAEDS is a trademark of SDRC. MSC/NASTRAN is a product of MacNeal-Schwendler Corp. CONTACT: Adage Inc., Billerica Linda Standford, 212/123-7070' UNION ALL
SELECT 2, 'Adage Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the Adage 6500 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin in December. OCEAN is a trademark of Adage Inc. MicroVAX and MicroVMS are trademarks of Digital Equipment Corp. PATRAN is a registered trademark of PDA Engineering. ANSYS is a registered trademark of Swanson Analysis Systems Inc. NASTRAN is a registered trademark of National Aeronautics and Space Administration. CADAM and IPC are registered trademarks of CADAM Inc. CATIA is a registered trademark of Dassault Systemes CAEDS is a trademark of SDRC. MSC/NASTRAN is a product of MacNeal-Schwendler Corp.'

DECLARE @Table2 TABLE (RecID INT PRIMARY KEY, Announcement VARCHAR(MAX))

INSERT @Table2
SELECT 2, 'Contact: Linda Standford, (212) 123-7070 September 10, 1985ADAGE, INC. INTRODUCES 6500 WORKSTATION FAMILYOF STAND-ALONE WORKSTATIONSBILLERICA, MA -- Adage, Inc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, todayannounced the Adage 6500 Family of stand-alone workstations thatallows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin inDecember. OCEAN is a trademark of Adage, Inc.MicroVAX and MicroVMS are trademarks of Digital EquipmentCorporation. PATRAN is a registered trademark of PDA Engineering.ANSYS is a registered trademark of Swanson Analysis Systems, Inc.NASTRAN is a registered trademark of National Aeronautics and SpaceAdministration. CADAM and IPC are registered trademarks of CADAM INC.CATIA is a registered trademark of Dassault Systems.CAEDS is a trademark of SDRC.MSC/NASTRAN is a product of MacNeal-Schwendler Corporation. Linda Standford, 212/123-7070'

SELECT t1a.RecID AS [Table1 RecID],
t2a.RecID AS [Table2 RecID],
1.0 * COUNT(*) / MAX(t1b.Items) AS [Table1 %],
1.0 * COUNT(*) / MAX(t2b.Items) AS [Table2 %]
FROM (
SELECT DISTINCT t1.RecID,
REPLACE(REPLACE(a.Data, ',', ''), '.', '') AS Data
FROM @Table1 AS t1
CROSS APPLY dbo.fnParseList(' ', t1.Announcement) AS a
) AS t1a
INNER JOIN (
SELECT DISTINCT t2.RecID,
REPLACE(REPLACE(a.Data, ',', ''), '.', '') AS Data
FROM @Table2 AS t2
CROSS APPLY dbo.fnParseList(' ', t2.Announcement) AS a
) AS t2a ON t2a.Data = t1a.Data
INNER JOIN (
SELECT t1.RecID,
COUNT(DISTINCT a.Data) AS Items
FROM @Table1 AS t1
CROSS APPLY dbo.fnParseList(' ', t1.Announcement) AS a
GROUP BY t1.RecID
) AS t1b ON t1b.RecID = t1a.RecID
INNER JOIN (
SELECT t2.RecID,
COUNT(DISTINCT a.Data) AS Items
FROM @Table2 AS t2
CROSS APPLY dbo.fnParseList(' ', t2.Announcement) AS a
GROUP BY t2.RecID
) AS t2b ON t2b.RecID = t2a.RecID
GROUP BY t1a.RecID,
t2a.RecID[/code]


E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-11-07 : 08:36:13
Have you tried using FULLTEXT INDEXING SERVICE?



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

kaduri
Starting Member

7 Posts

Posted - 2007-11-07 : 09:14:32
You seem to be using a custom function "fnParseList". Can you provide the source for it?

I haven't thought about FULLTEXT INDEXING SERVICE. I will look into it.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-11-07 : 10:03:08
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=76033



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

kaduri
Starting Member

7 Posts

Posted - 2007-11-08 : 01:54:35
Very nice functions! I must admit that I do not understnd the eactly how the query works, but it seems to work great.

I had posted the same question also here http://www.sqlservercentral.com/Forums/Topic419526-338-1.aspx and got a suggestion to use an LCS algorithm.

Thank you very much!!!
Go to Top of Page
   

- Advertisement -