Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
kaduri
Starting Member
7 Posts |
Posted - 2007-11-07 : 06:01:54
|
Hello everyone, I have 2 identical tables each containing 3 fields (ID, Date, Text). Each table holds news announcements from different publishers using slightly different presentations of identical announcements. I am trying to identify duplicate anouncements. Example: Table 1:ID: 872 Date: 3/20/1999 Text: SomeTown, MA -- YYY, Inc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, today announced the 1234 Family of stand-alone workstations xxx is a registered trademark of yyyTable 2ID: 297Date:3/20/1999Text: March 3, 1999YYY Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the 1234 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments. Important: The length and contents of the text in red is NOT constant. Only the green substring may appear in both entries (if the two are in fact duplicate anouncements).I am using SQL Server 2005 Express on Windows XPAny help/suggestion will be greatly appreciated. |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-11-07 : 06:05:39
|
First filter by Date.Second filter by DATALENGTH. If datalength is different, the records are not identical.SELECT Date, DATALENGTH(Text) FRMO Table1 GROUP BY Date, DATALENGTH(Text) HAVING COUNT(*) > 1 E 12°55'05.25"N 56°04'39.16" |
 |
|
kaduri
Starting Member
7 Posts |
Posted - 2007-11-07 : 06:10:48
|
Hi Peso, The datalength is not a valid indicator since the text in red is not constant, unknown, and different for each record. Only the green substring embedded within each announcement can be usd to determine whether the to records are in deed identical |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-11-07 : 06:25:01
|
Really?If you want to go for IDENTICAL records, start out with those that are of same length and date.Can two records of different length be identical?Can two records of different dates be identical?After finding records with same date and length, then it's time to see if the CONTENT of TEXT column is identical. I think you can do that with LIKE operator. E 12°55'05.25"N 56°04'39.16" |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-11-07 : 06:32:51
|
I think I got what you are after, now.How do you know that it is the complete green part that is identical?Why not just the shorter "computer graphics workstations" part? E 12°55'05.25"N 56°04'39.16" |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-11-07 : 06:33:29
|
Also, is there a special reason to use TEXT in favor of VARCHAR(8000)?Or even VARCHAR(MAX)? E 12°55'05.25"N 56°04'39.16" |
 |
|
kaduri
Starting Member
7 Posts |
Posted - 2007-11-07 : 06:42:23
|
The green part will exist in both records if they are to be considered identical. However, I do not now the text of the green part as this is to be identified. Once I now and can extract the the green part, I will be able to add additional criteria to making a decision (e.g. if the green part is more than 90% of the entire text, I will consider them to be identical.In the Db, I am using nvarchar(max) |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-11-07 : 07:03:48
|
90% based on what? Character matching? Word matching?Must they be in same order?IsInc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, today announced the 1234 Family of stand-alone workstationsthe same asInc., manufacturer and marketer of interactivc, high-performance computer graphics workstations, today announced the 1234 Family of stand-alone workstations E 12°55'05.25"N 56°04'39.16" |
 |
|
kaduri
Starting Member
7 Posts |
Posted - 2007-11-07 : 07:26:50
|
Peso, Thanks for your effort in trying to help. I truly appreciate it!Here are two records that for my purposes are considered identical: Note the following (which makes this thing so complicated)1. Both annoucnements have different introductions and endings2. Within the text, sometimes spaces or other specil characters are dfferent. This however, should not be a problem as I can strip the text off any spaces and characters if neccessary)The 90% thing was only an example of what I can do once I determine that entries such as these are identical (please ignore it for now)Announcemenet 1:Adage Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the Adage 6500 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin in December. OCEAN is a trademark of Adage Inc. MicroVAX and MicroVMS are trademarks of Digital Equipment Corp. PATRAN is a registered trademark of PDA Engineering. ANSYS is a registered trademark of Swanson Analysis Systems Inc. NASTRAN is a registered trademark of National Aeronautics and Space Administration. CADAM and IPC are registered trademarks of CADAM Inc. CATIA is a registered trademark of Dassault Systemes CAEDS is a trademark of SDRC. MSC/NASTRAN is a product of MacNeal-Schwendler Corp. CONTACT: Adage Inc., Billerica Linda Standford, 212/123-7070-----------------------------------------------------------------Announcemenet 2:Contact: Linda Standford, (212) 123-7070 September 10, 1985ADAGE, INC. INTRODUCES 6500 WORKSTATION FAMILYOF STAND-ALONE WORKSTATIONSBILLERICA, MA -- Adage, Inc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, todayannounced the Adage 6500 Family of stand-alone workstations thatallows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin inDecember. OCEAN is a trademark of Adage, Inc.MicroVAX and MicroVMS are trademarks of Digital EquipmentCorporation. PATRAN is a registered trademark of PDA Engineering.ANSYS is a registered trademark of Swanson Analysis Systems, Inc.NASTRAN is a registered trademark of National Aeronautics and SpaceAdministration. CADAM and IPC are registered trademarks of CADAM INC.CATIA is a registered trademark of Dassault Systems.CAEDS is a trademark of SDRC.MSC/NASTRAN is a product of MacNeal-Schwendler Corporation. Linda Standford, 212/123-7070 |
 |
|
kaduri
Starting Member
7 Posts |
Posted - 2007-11-07 : 07:40:01
|
My two tables contain thoussands of announcements and I need to run a query/procedure that will match entries on the Date fields between the two tables and then compare announcemenet swith the same date to SOMEHOW PROGRAMMATICALLY decide if they are identical. |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-11-07 : 08:35:01
|
[code]DECLARE @Table1 TABLE (RecID INT PRIMARY KEY, Announcement VARCHAR(MAX))INSERT @Table1SELECT 1, 'Adage Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the Adage 6500 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin in December. OCEAN is a trademark of Adage Inc. MicroVAX and MicroVMS are trademarks of Digital Equipment Corp. PATRAN is a registered trademark of PDA Engineering. ANSYS is a registered trademark of Swanson Analysis Systems Inc. NASTRAN is a registered trademark of National Aeronautics and Space Administration. CADAM and IPC are registered trademarks of CADAM Inc. CATIA is a registered trademark of Dassault Systemes CAEDS is a trademark of SDRC. MSC/NASTRAN is a product of MacNeal-Schwendler Corp. CONTACT: Adage Inc., Billerica Linda Standford, 212/123-7070' UNION ALLSELECT 2, 'Adage Inc., manufacturer and marketer of high-performance, interactive computer graphics workstations, Tuesday announced the Adage 6500 Family of stand-alone workstations that allows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin in December. OCEAN is a trademark of Adage Inc. MicroVAX and MicroVMS are trademarks of Digital Equipment Corp. PATRAN is a registered trademark of PDA Engineering. ANSYS is a registered trademark of Swanson Analysis Systems Inc. NASTRAN is a registered trademark of National Aeronautics and Space Administration. CADAM and IPC are registered trademarks of CADAM Inc. CATIA is a registered trademark of Dassault Systemes CAEDS is a trademark of SDRC. MSC/NASTRAN is a product of MacNeal-Schwendler Corp.'DECLARE @Table2 TABLE (RecID INT PRIMARY KEY, Announcement VARCHAR(MAX))INSERT @Table2SELECT 2, 'Contact: Linda Standford, (212) 123-7070 September 10, 1985ADAGE, INC. INTRODUCES 6500 WORKSTATION FAMILYOF STAND-ALONE WORKSTATIONSBILLERICA, MA -- Adage, Inc., manufacturer and marketer of high-performance, interactivc computer graphics workstations, todayannounced the Adage 6500 Family of stand-alone workstations thatallows users to operate in both DEC and IBM computing environments. Shipments of the new workstations will begin inDecember. OCEAN is a trademark of Adage, Inc.MicroVAX and MicroVMS are trademarks of Digital EquipmentCorporation. PATRAN is a registered trademark of PDA Engineering.ANSYS is a registered trademark of Swanson Analysis Systems, Inc.NASTRAN is a registered trademark of National Aeronautics and SpaceAdministration. CADAM and IPC are registered trademarks of CADAM INC.CATIA is a registered trademark of Dassault Systems.CAEDS is a trademark of SDRC.MSC/NASTRAN is a product of MacNeal-Schwendler Corporation. Linda Standford, 212/123-7070'SELECT t1a.RecID AS [Table1 RecID], t2a.RecID AS [Table2 RecID], 1.0 * COUNT(*) / MAX(t1b.Items) AS [Table1 %], 1.0 * COUNT(*) / MAX(t2b.Items) AS [Table2 %]FROM ( SELECT DISTINCT t1.RecID, REPLACE(REPLACE(a.Data, ',', ''), '.', '') AS Data FROM @Table1 AS t1 CROSS APPLY dbo.fnParseList(' ', t1.Announcement) AS a ) AS t1aINNER JOIN ( SELECT DISTINCT t2.RecID, REPLACE(REPLACE(a.Data, ',', ''), '.', '') AS Data FROM @Table2 AS t2 CROSS APPLY dbo.fnParseList(' ', t2.Announcement) AS a ) AS t2a ON t2a.Data = t1a.DataINNER JOIN ( SELECT t1.RecID, COUNT(DISTINCT a.Data) AS Items FROM @Table1 AS t1 CROSS APPLY dbo.fnParseList(' ', t1.Announcement) AS a GROUP BY t1.RecID ) AS t1b ON t1b.RecID = t1a.RecIDINNER JOIN ( SELECT t2.RecID, COUNT(DISTINCT a.Data) AS Items FROM @Table2 AS t2 CROSS APPLY dbo.fnParseList(' ', t2.Announcement) AS a GROUP BY t2.RecID ) AS t2b ON t2b.RecID = t2a.RecIDGROUP BY t1a.RecID, t2a.RecID[/code] E 12°55'05.25"N 56°04'39.16" |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-11-07 : 08:36:13
|
Have you tried using FULLTEXT INDEXING SERVICE? E 12°55'05.25"N 56°04'39.16" |
 |
|
kaduri
Starting Member
7 Posts |
Posted - 2007-11-07 : 09:14:32
|
You seem to be using a custom function "fnParseList". Can you provide the source for it?I haven't thought about FULLTEXT INDEXING SERVICE. I will look into it. |
 |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2007-11-07 : 10:03:08
|
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=76033 E 12°55'05.25"N 56°04'39.16" |
 |
|
kaduri
Starting Member
7 Posts |
Posted - 2007-11-08 : 01:54:35
|
Very nice functions! I must admit that I do not understnd the eactly how the query works, but it seems to work great.I had posted the same question also here http://www.sqlservercentral.com/Forums/Topic419526-338-1.aspx and got a suggestion to use an LCS algorithm.Thank you very much!!! |
 |
|
|
|
|
|
|