Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Script Library
 String Compare/Difference - Common Subsequences

Author  Topic 

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-07-03 : 11:12:26
I have a database of configuration entries with multiple, slightly different copies and was looking for an easy way to strip out the differences between versions to come up with a template for each one.

I searched the internet for a string comparison / difference function, procedure or CLR function and couldn't find one for SQL Server. Instead, I ran across a reference to a paper describing an algorithm and decided, what the heck, let's try it. The algorithm I used is the O(ND) Difference Algorithm over at http://www.xmailserver.org/diff2.pdf.

There are a couple points that I should note:
1.) The larger the strings are AND the more differences that exist between the strings, the slower this will go. For most use cases, this should be OK, but don't bother comparing > 1k strings with little to no similarities.
2.) It only returns common subsequences at this point -- ideally, I would like to have it return the differences as well. The algorithm itself is meant to produce Shortest Edit Script (SES), which is what you would use to visualize differences akin to a source code comparison tool.
3.) Ideally, this would be implemented as a CLR function and that will mitigate much of the issues.
4.) For my purposes, I decided it would work better if I didn't include single character matches in the end result. I intend to update the function to include a parameter to control this, but haven't gotten to it yet.
5.) Since performance dropped off once string size increases into the thousands, I decided to limit the string sizes to 8k.

Please feel free to point me to other implementations of difference algorithms, tell me this isn't the right layer to implement this, or offer constructive feedback to improve performance or features.

First, before we get to the algorithm, you will need a number generator or source. I chose to use a technique I lifted from Jeff Moden's excellent Split function, so I am providing it here:

[CODE]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF OBJECT_ID('dbo.udfNumberFast') IS NOT NULL DROP FUNCTION dbo.udfNumberFast;
GO
CREATE FUNCTION [dbo].[udfNumberFast]
(
@TOTAL INT
)
/*
This FUNCTION RETURNS an integer TABLE containing ALL integers
IN the range of @START_NUMBER through @END_NUMBER, inclusive.
The maximum number of rows that this FUNCTION can RETURN
IS 16777216.
*/
RETURNS TABLE
AS
RETURN
(
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT TOP (@TOTAL) 1 FROM E2 a, E2 b, E2 c, E2 d, E2 e), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
SELECT c.N AS Number
FROM cteTally
AS c
)
GO
[/CODE]

Here is the string difference/compare function that finds the common sub-sequences between two strings:

[CODE]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF OBJECT_ID('dbo.udfCompareStrings') IS NOT NULL DROP FUNCTION dbo.udfCompareStrings;
GO
CREATE FUNCTION [dbo].[udfCompareStrings]
(
@String1 VARCHAR(8000),
@String2 VARCHAR(8000)
)
RETURNS @Output TABLE
(
d SMALLINT, -- Number of non-diagonal edges to this point
k SMALLINT, -- Diagonal
x SMALLINT, -- Character X coordinate in grid
y SMALLINT, -- Character Y coordinate in grid
LastK SMALLINT, -- Breadcrumb of path
NodeCount SMALLINT, -- Number of characters in the diagonal
String VARCHAR(8000)
)
AS
-- Implementation of the O(ND) Difference algorithm at:
-- http://www.xmailserver.org/diff2.pdf
-- Algorithm ported by Lazerath on 7/3/2014

BEGIN
-------------------------------------------------------------------------------------------------------
-- INITIALIZATION
-------------------------------------------------------------------------------------------------------
--SET NOCOUNT ON;
DECLARE @DT DATETIME;
SET @DT = GETDATE();

DECLARE @String1Len SMALLINT,
@String2Len SMALLINT,
@StringMINLen SMALLINT,
@StringMAXLen SMALLINT,
@k_m SMALLINT,
@k_n SMALLINT,
@IDX_d SMALLINT,
@IDX_k SMALLINT,
@x_k1 SMALLINT,
@x_k2 SMALLINT,
@NodeCount SMALLINT,
@LastK SMALLINT,
@x SMALLINT,
@y SMALLINT;

DECLARE @Number TABLE (Number SMALLINT PRIMARY KEY);
DECLARE @String1Chars TABLE (x SMALLINT PRIMARY KEY, VALUE CHAR(1));
DECLARE @String2Chars TABLE (y SMALLINT PRIMARY KEY, VALUE CHAR(1));

DECLARE @EditGraph
TABLE (
k SMALLINT,
x SMALLINT,
y SMALLINT,
x2 SMALLINT,
isConnectedAfter BIT
PRIMARY KEY(k,x)
);

DECLARE @V
TABLE (
k SMALLINT, -- Starting Node
x SMALLINT, -- Starting Node
y SMALLINT, -- Starting Node
LastK SMALLINT,
NodeCount SMALLINT,
PRIMARY KEY(k)
);

DECLARE @VPath
TABLE (
d SMALLINT,
k SMALLINT, -- Starting Node
x SMALLINT, -- Starting Node
y SMALLINT, -- Starting Node
LastK SMALLINT,
NodeCount SMALLINT,
PRIMARY KEY(d,k)
);

-------------------------------------------------------------------------------------------------------
-- Begin Code
-------------------------------------------------------------------------------------------------------

SET @String1Len = LEN(@String1);
SET @String2Len = LEN(@String2);
SET @StringMINLen = CASE WHEN @String1Len < @String2Len THEN @String1Len ELSE @String2Len END;
SET @StringMAXLen = CASE WHEN @String1Len < @String2Len THEN @String2Len ELSE @String1Len END;

INSERT @Number(Number)
SELECT n.Number
FROM dbo.udfNumberFast(@StringMAXLen) AS n;

INSERT @String1Chars(x,VALUE)
SELECT n.Number AS x,SUBSTRING(@String1,n.Number,1) AS VALUE
FROM @Number AS n
WHERE n.Number <= @String1Len;

INSERT @String2Chars(y,VALUE)
SELECT n.Number AS y,SUBSTRING(@String2,n.Number,1) AS VALUE
FROM @Number AS n
WHERE n.Number <= @String2Len;

DECLARE @msg VARCHAR(100);

SET @IDX_d = 0
WHILE @IDX_d <= @String1Len + @String2Len
BEGIN
SET @k_m = CASE WHEN @StringMINLen < @String2Len-@IDX_d THEN @StringMINLen ELSE @String2Len-@IDX_d END;
SET @k_n = CASE WHEN @StringMINLen < @String1Len-@IDX_d THEN @StringMINLen ELSE @String1Len-@IDX_d END;
SET @k_m = CASE WHEN @k_m < 0 THEN 0 ELSE @k_m END;
SET @k_n = CASE WHEN @k_n < 0 THEN 0 ELSE @k_n END;

;WITH cteDiagonalGridPoint
AS
(
SELECT -@IDX_d AS k,
n.Number AS x,
n.Number+@IDX_d AS y
FROM @Number AS n
WHERE n.Number <= @k_m
UNION
SELECT @IDX_d AS k,
n.Number+@IDX_d AS x,
n.Number AS y
FROM @Number AS n
WHERE n.Number <= @k_n
)
INSERT @EditGraph (k,x,y,x2)
SELECT d.k,d.x,d.y,d.x
FROM cteDiagonalGridPoint AS d
JOIN @String1Chars AS s1
ON d.x = s1.x
JOIN @String2Chars AS s2
ON d.y = s2.y
WHERE s1.Value = s2.Value;

UPDATE e1
SET e1.isConnectedAfter = 0
FROM @EditGraph AS e1
WHERE e1.k IN(-@IDX_d,@IDX_d)
AND NOT EXISTS
(
SELECT *
FROM @EditGraph AS e2
WHERE e1.k = e2.k AND e1.x = e2.x-1
);

UPDATE e1
SET e1.x2
= (
SELECT MIN(e2.x) AS x2
FROM @EditGraph AS e2
WHERE e1.k = e2.k
AND e1.x <= e2.x
AND e2.isConnectedAfter = 0
)
FROM @EditGraph AS e1
WHERE e1.k IN(-@IDX_d,@IDX_d);

-- DELETE nodes at the end of a tunnel that only span a single character
DELETE g
FROM @EditGraph AS g
WHERE g.x = g.x2
AND g.isConnectedAfter=0;

INSERT @v(k,x,y,LastK,NodeCount)
SELECT -@IDX_d,0,0,0,0
UNION
SELECT +@IDX_d,0,0,0,0

SET @IDX_k = -@IDX_d;
WHILE (@IDX_k <= @IDX_d)
BEGIN
SET @x_k1=COALESCE((SELECT v.x FROM @v AS v WHERE v.k=@IDX_k-1),0);
SET @x_k2=COALESCE((SELECT v.x FROM @v AS v WHERE v.k=@IDX_k+1),0);

-- Am I at the beginning of the loop? If so, take the k+1 diagonal since the k-1 doesn't exist
-- If I am not at the end and the k+1 diagonal is a farther path, use that, otherwise, use the k-1
IF (@IDX_k=-@IDX_d OR (@IDX_k < @IDX_d AND @x_k1 < @x_k2))
BEGIN
SET @x = @x_k2; -- Vertical Edge
SET @LastK = @IDX_k+1;
END
ELSE
BEGIN
SET @x = @x_k1+1; -- Horizontal Edge
SET @LastK = @IDX_k-1;
END

SET @y = @x - @IDX_k;
SET @NodeCount = 0;

-- Follow the snake (if one exists)!
SELECT @x = e.x2,
@y = e.x2-@IDX_k,
@NodeCount = e.x2-e.x+1
FROM @EditGraph AS e
WHERE e.k = @IDX_k
AND e.x = @x+1;

-- Update the current status of my path progress on this diagonal (k)
UPDATE v
SET x = @x,
y = @y,
LastK=@LastK,
NodeCount = @NodeCount
FROM @v AS v
WHERE k = @IDX_k;

-- Save my state so I can come back later and get the full path of the SES
INSERT @VPath SELECT @IDX_d,* FROM @v WHERE k=@IDX_k;

-- Check if I am at the end or I have somehow run out of d values
IF @x >= @String1Len AND @y >= @String2Len OR @IDX_d > @String1Len + @String2Len
BEGIN
GOTO END_LOOP;
END

SET @IDX_k = @IDX_k + 2;
END

SET @IDX_d = @IDX_d +1;
END

END_LOOP:

-- Go backwards to eliminate all path entries that didn't contribute to the SES
WHILE (@IDX_d > 0)
BEGIN

-- Delete all path entries that weren't part of the SES
DELETE p
FROM @VPath AS p
WHERE p.d = @IDX_d
AND NOT (p.k = @IDX_k);

-- Determine the last K that helped me get to this D (it identifies the path of the SES)
SELECT @IDX_k = p.LastK
FROM @VPath AS p
WHERE p.d = @IDX_d
AND p.k = @IDX_k;

SET @IDX_d = @IDX_d-1;
END

INSERT @Output
SELECT v.d,
v.k,
v.x,
v.y,
v.LastK,
v.NodeCount,
SUBSTRING(@String1,v.x-v.NodeCount+1,v.NodeCount) AS SubSequence
FROM @VPath AS v
WHERE v.NodeCount > 0
ORDER BY v.d,
v.k,
v.x;

RETURN;
END
GO
[/CODE]

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2014-07-03 : 13:08:58
I'm probably dull today, but doesn't this give the very same result?
DECLARE	@String1 VARCHAR(8000) = 'Aardvaark',
@String2 VARCHAR(8000) = 'aard';

WITH cteString1(Pos, chr)
AS (
SELECT Number,
SUBSTRING(@String1, Number, 1)
FROM dbo.udfNumberFast(LEN(@String1))
), cteString2(Pos, chr)
AS (
SELECT Number,
SUBSTRING(@String2, Number, 1)
FROM dbo.udfNumberFast(LEN(@String2))
), cteSource(pos1, pos2)
AS (
SELECT s1.pos AS po1,
s2.pos AS pos2
FROM cteString1 AS s1
INNER JOIN cteString2 AS s2 ON s2.chr = s1.chr
GROUP BY s1.pos,
s2.pos
)
SELECT MIN(pos1) AS FromPos,
MAX(pos1) AS ToPos,
COUNT(*) AS Characters,
SUBSTRING(@String1, MIN(pos1), COUNT(*)) AS Letters
FROM cteSource
GROUP BY pos1 - pos2
HAVING COUNT(*) >= 2
ORDER BY COUNT(*) DESC



Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA
Go to Top of Page

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-07-03 : 13:37:36
No, very different. The differences are visible with your example, however become far starker given other examples:



/* -- Result from swepeso's function
FromPos ToPos Characters Letters
1 4 4 Aard
6 8 3 aar
--*/

DECLARE @String1 VARCHAR(8000) = 'Aardvaark',
@String2 VARCHAR(8000) = 'aard';

SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d k x y LastK NodeCount String
0 0 4 4 1 4 Aard
--*/


Try this:


DECLARE @String1 VARCHAR(8000),
@String2 VARCHAR(8000);

SET @String1 = 'Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.';

SET @String2 = 'Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.

You can’t search code for the use of a column
That’s an interesting one. Say I’ve got a USERS table with a Title column.';

/* -- Result from swepeso's function
FromPos ToPos Characters Letters
1 145 145 Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.
3 133 27 lumns are returned in an un
13 136 20 returned in an unkno
10 133 20 re returned in an un
12 126 18 returned in an un
1 114 18 Columns are return
13 144 17 returned in an un
4 91 17 umns are returned
3 122 16 lumns are return
8 127 16 are returned in
6 142 16 ns are returned
8 138 16 are returned in
2 120 16 olumns are retur

... RESULTS REDUCED FOR BREVITY ...

122 132 2 or
9 22 2 ar
12 24 2 r
106 123 2 nd
138 139 2 co
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d k x y LastK NodeCount String
0 0 145 145 1 145 Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.
--*/


The function dbo.udfCompareStrings correctly determines that both strings have the same sequence and only lists that sequence once in the final result. Try something like this to:



DECLARE @String1 VARCHAR(8000),
@String2 VARCHAR(8000);

SET @String1 = REPLICATE('<TESTURL>http://www.hgmailh.com</TESTURL> ',3);
SET @String2 = REPLICATE('<TESTURL>http://www.gmail.com</TESTURL>',3);

/* -- Result from swepeso's function
FromPos ToPos Characters Letters
1 107 24 <TESTURL>http://www.hgma
5 86 22 TURL>http://www.hgmail
11 104 22 ttp://www.hgmailh.com<
5 62 21 TURL>http://www.hgmai
1 44 21 <TESTURL>http://www.h
43 86 21 <TESTURL>http://www.h
37 118 21 TURL> <TESTURL>http:/
47 104 21 TURL>http://www.hgmai
1 20 20 <TESTURL>http://www.
19 76 20 w.hgmailh.com</TESTU
19 76 20 w.hgmailh.com</TESTU
61 118 20 w.hgmailh.com</TESTU
85 104 20 <TESTURL>http://www.
28 125 17 .com</TESTURL> <T
37 83 16 TURL> <TESTURL>h
79 125 16 TURL> <TESTURL>h
103 125 15 w.hgmailh.com</
19 41 15 w.hgmailh.com</
40 115 12 L> <TESTURL>
2 53 12 TESTURL>http
5 86 12 TURL>http://
44 95 12 TESTURL>http
37 118 12 TURL> <TESTU
5 76 11 TURL>http:/

... RESULTS REDUCED FOR BREVITY ...

82 89 2 L>
16 95 2 /w
63 112 2 hg
37 57 2 TU
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d k x y LastK NodeCount String
0 0 20 20 1 20 <TESTURL>http://www.
1 1 26 25 0 5 gmail
2 2 41 39 1 14 .com</TESTURL>
3 3 62 59 2 20 <TESTURL>http://www.
4 4 68 64 3 5 gmail
5 5 83 78 4 14 .com</TESTURL>
6 6 104 98 5 20 <TESTURL>http://www.
7 7 110 103 6 5 gmail
8 8 125 117 7 14 .com</TESTURL>
--*/


The algorithm is able to correctly line up the two very similar strings, with repeated segments, in order to find and eliminate the minor differences properly.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2014-07-03 : 15:33:23
So the difference is a TOP(1) ?


Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA
Go to Top of Page

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-07-03 : 16:27:16
No, that is not correct, TOP(1) would only give you the the longest common sub-sequence. Take a look at the last example... selecting TOP 1 with your function would not produce anything close to the correct result. In fact, that would yield "<TESTURL>http://www.hgma", which itself is not even common to both strings, thereby even failing the goal of achieving longest common sub-sequence. Even if you modified your method to only select the longest common sub-sequence between two strings, that is not what this function aims to achieve.

Consider the problem I outlined. If you want to take a configuration entry for a webservice from your development environment and compare it against the configuration entry from production in order to determine a general template that you could then use to create configuration entries for QA, UAT, etc...


DECLARE @String1 VARCHAR(8000),@String2 VARCHAR(8000);

SET @String1 = '<connectionString>'+
'<url>https://applications.companyname.com/APP1</url>'+
'<proxy>http://proxy.companyname.com:5555</proxy>' +
'<licenseKey>570038F0-739D-43EA-A105-FCA09A1466E9</licenseKey>' +
'<username>15E10D713A9F4BE782954FEE10D0CD40</username>' +
'<password>A67B5FEFD56E4D2A97A3B4CA28B583C8</password>' +
'</connectionString>';
SET @String2 = '<connectionString>' +
'<url>https://VXABC123SERVER/APP1</url>' +
'<proxy>http://VXABC123SERVER:5555</proxy>' +
'<licenseKey>EC131372-1A3B-C964-80F7-69FE8B6991C1</licenseKey>' +
'<username>08EDFB1479DA42CAA1190B686AFFDAF1</username>' +
'<password>CBB30AF546B34A5499CFAB2C276B8C0A</password>' +
'</connectionString>';


/* -- Result from swepeso's function
FromPos ToPos Characters Letters
36 304 114 ications.companyname.com/APP1</url><proxy>http://proxy.companyname.com:5555</proxy><licenseKey>570038F0-739D-43EA-
39 290 38 tions.companyname.com/APP1</url><proxy
1 235 35 <connectionString><url>https://appl
65 285 27 </url><proxy>http://proxy.c
27 249 24 ps://applications.compan
12 229 21 String><url>https://a
28 297 21 s://applications.comp
39 302 19 tions.companyname.c
39 266 18 tions.companyname.
288 304 17 connectionString>
92 298 17 ompanyname.com:55
14 237 17 ring><url>https:/

... RESULTS REDUCED FOR BREVITY ...

228 287 2 na
31 52 2 /a
283 294 2 rd
45 68 2 co
7 36 2 ct
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
0 0 31 31 1 31 <connectionString><url>https://
42 14 84 70 15 25 /APP1</url><proxy>http://
77 21 130 109 22 25 :5555</proxy><licenseKey>
149 21 189 168 22 23 </licenseKey><username>
213 21 242 221 22 21 </username><password>
277 21 304 283 22 30 </password></connectionString>
--*/


The result from udfCompareStrings can be concatenated together to form a working configuration entry (sans variables). All that would be needed would be to add a placeholder for the variable name that could programatically replaced.

One thing I can see I need to adjust is the threshold for matches. Right now, it will take any 2 or greater combination, however I can easily see the need for dynamic adjustment.
Go to Top of Page

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-07-03 : 17:04:17
I have reviewed the method you posted above and see where your problem lies:


SELECT MIN(pos1) AS FromPos,
MAX(pos1) AS ToPos,
COUNT(*) AS Characters,
SUBSTRING(@String1, MIN(pos1), COUNT(*)) AS Letters
FROM cteSource
GROUP BY pos1 - pos2
HAVING COUNT(*) >= 2
ORDER BY COUNT(*) DESC


You are grouping by the k value (diagonal aka pos1-pos2) and taking the first matching character and extending your string for the total number of matching characters on that diagonal. Unfortunately, this will produce incorrect results on all but the best case scenarios. The reason being is that many strings will have non-contiguous match points. Take for example this string comparison:


DECLARE @String1 VARCHAR(8000),@String2 VARCHAR(8000);

SET @String1 = 'aaaaaBccccccccDeeeeeeee';
SET @String2 = 'aaaaaaFcccccccGeeeeeee';


/* -- Result from swepeso's function
FromPos ToPos Characters Letters
1 22 19 aaaaaBccccccccDeeee
1 21 18 aaaaaBccccccccDeee
2 23 17 aaaaBccccccccDeee
1 20 15 aaaaaBccccccccD
3 23 14 aaaBccccccccDe
1 19 12 aaaaaBcccccc
4 23 11 aaBcccccccc
1 18 9 aaaaaBccc
5 23 8 aBcccccc
1 17 6 aaaaaB
13 23 5 ccDee
7 16 3 ccc
14 23 3 cDe
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d k x y LastK NodeCount String
0 0 5 5 1 5 aaaaa
3 -1 13 14 0 7 ccccccc
6 0 22 22 1 7 eeeeeee
--*/


Only a single result returned from your method is common across both strings, 'ccc', and it is not even the complete sub sequence.
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2014-07-03 : 19:59:40
Ok, I spent 30 minutes on this version vs 10 minutes on the previous
DECLARE	@String1 VARCHAR(8000) = 'aaaaaBccccccccDeeeeeeee',
@String2 VARCHAR(8000) = 'aaaaaaFcccccccGeeeeeee';

WITH cteSource(FromPosition, ToPosition, Characters)
AS (
SELECT s.Number AS FromPosition,
s.Number + MAX(l.Number) - 1 AS ToPosition,
MAX(l.Number) AS Characters
FROM dbo.udfNumberFast(LEN(@String1)) AS s
CROSS APPLY dbo.udfNumberFast(LEN(@String1) - s.Number + 1) AS l
WHERE @String2 LIKE '%' + SUBSTRING(@String1, s.Number, l.Number) + '%'
GROUP BY s.Number
), cteDisplay(FromPosition, ToPosition, Characters, Expression, rnk)
AS (
SELECT MIN(FromPosition) AS FromPosition,
ToPosition,
MAX(Characters) AS Characters,
SUBSTRING(@String1, MIN(FromPosition), MAX(Characters)) AS Expression,
DENSE_RANK() OVER (PARTITION BY SUBSTRING(@String1, MIN(FromPosition), MAX(Characters)) ORDER BY MIN(FromPosition)) AS rnk
FROM cteSource
GROUP BY ToPosition
)
SELECT FromPosition,
ToPosition,
Characters,
Expression
FROM cteDisplay
WHERE rnk = 1
ORDER BY Characters DESC,
FromPosition;



Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA
Go to Top of Page

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-08-13 : 18:01:49
@SwePeso: Thank you for your input into this. Your solution is very clever and a very neat technique. I apologize for the delay in responding, but my wife and I had a child and I have been very busy.

That said, it is not perfect. Unfortunately, this method does not serve the same purpose.

  • As constructed, it does not work with repeating segments due to the grouping. This is a requirement of such string comparison. Compare the output with the repeated gmail example by comparing REPLICATE('<TESTURL>http://www.hgmailh.com</TESTURL> ',3) against REPLICATE('<TESTURL>http://www.gmail.com</TESTURL>',3)

  • Even if this is restructured to avoid the grouping, this method has problems with "refinding" smaller sequences that have already been found. Try comparing 'ThisIsAString,ThisAbsolutelyIsHis' against 'ThisIsAString'



Again, I really appreciate the attempt.

I have some optimizations and updates to the original PoC function that makes the whole thing much cleaner and faster. I will try to post tomorrow.
Go to Top of Page
   

- Advertisement -