String Compare/Difference - Common Subsequences

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

Script Library

String Compare/Difference - Common Subsequences

Author

Topic

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-07-03 : 11:12:26

I have a database of configuration entries with multiple, slightly different copies and was looking for an easy way to strip out the differences between versions to come up with a template for each one.

I searched the internet for a string comparison / difference function, procedure or CLR function and couldn't find one for SQL Server. Instead, I ran across a reference to a paper describing an algorithm and decided, what the heck, let's try it. The algorithm I used is the O(ND) Difference Algorithm over at http://www.xmailserver.org/diff2.pdf.

There are a couple points that I should note:
1.) The larger the strings are AND the more differences that exist between the strings, the slower this will go. For most use cases, this should be OK, but don't bother comparing > 1k strings with little to no similarities.
2.) It only returns common subsequences at this point -- ideally, I would like to have it return the differences as well. The algorithm itself is meant to produce Shortest Edit Script (SES), which is what you would use to visualize differences akin to a source code comparison tool.
3.) Ideally, this would be implemented as a CLR function and that will mitigate much of the issues.
4.) For my purposes, I decided it would work better if I didn't include single character matches in the end result. I intend to update the function to include a parameter to control this, but haven't gotten to it yet.
5.) Since performance dropped off once string size increases into the thousands, I decided to limit the string sizes to 8k.

Please feel free to point me to other implementations of difference algorithms, tell me this isn't the right layer to implement this, or offer constructive feedback to improve performance or features.

First, before we get to the algorithm, you will need a number generator or source. I chose to use a technique I lifted from Jeff Moden's excellent Split function, so I am providing it here:

[CODE]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF OBJECT_ID('dbo.udfNumberFast') IS NOT NULL DROP FUNCTION dbo.udfNumberFast;
GO
CREATE FUNCTION [dbo].[udfNumberFast]
(
@TOTAL INT
)
/*
This FUNCTION RETURNS an integer TABLE containing ALL integers
IN the range of @START_NUMBER through @END_NUMBER, inclusive.
The maximum number of rows that this FUNCTION can RETURN
IS 16777216.
*/
RETURNS TABLE
AS
RETURN
(
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
-- enough to cover NVARCHAR(4000)
WITH E1(N) AS (
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL
SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
), --10E+1 or 10 rows
E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
E4(N) AS (SELECT TOP (@TOTAL) 1 FROM E2 a, E2 b, E2 c, E2 d, E2 e), --10E+4 or 10,000 rows max
cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
-- for both a performance gain and prevention of accidental "overruns"
SELECT ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
)
SELECT c.N AS Number
FROM cteTally
AS c
)
GO
[/CODE]

Here is the string difference/compare function that finds the common sub-sequences between two strings:

[CODE]
GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF OBJECT_ID('dbo.udfCompareStrings') IS NOT NULL DROP FUNCTION dbo.udfCompareStrings;
GO
CREATE FUNCTION [dbo].[udfCompareStrings]
(
@String1 VARCHAR(8000),
@String2 VARCHAR(8000)
)
RETURNS @Output TABLE
(
d SMALLINT, -- Number of non-diagonal edges to this point
k SMALLINT, -- Diagonal
x SMALLINT, -- Character X coordinate in grid
y SMALLINT, -- Character Y coordinate in grid
LastK SMALLINT, -- Breadcrumb of path
NodeCount SMALLINT, -- Number of characters in the diagonal
String VARCHAR(8000)
)
AS
-- Implementation of the O(ND) Difference algorithm at:
-- http://www.xmailserver.org/diff2.pdf
-- Algorithm ported by Lazerath on 7/3/2014

BEGIN
-------------------------------------------------------------------------------------------------------
-- INITIALIZATION
-------------------------------------------------------------------------------------------------------
--SET NOCOUNT ON;
DECLARE @DT DATETIME;
SET @DT = GETDATE();

DECLARE @String1Len SMALLINT,
@String2Len SMALLINT,
@StringMINLen SMALLINT,
@StringMAXLen SMALLINT,
@k_m SMALLINT,
@k_n SMALLINT,
@IDX_d SMALLINT,
@IDX_k SMALLINT,
@x_k1 SMALLINT,
@x_k2 SMALLINT,
@NodeCount SMALLINT,
@LastK SMALLINT,
@x SMALLINT,
@y SMALLINT;

DECLARE @Number TABLE (Number SMALLINT PRIMARY KEY);
DECLARE @String1Chars TABLE (x SMALLINT PRIMARY KEY, VALUE CHAR(1));
DECLARE @String2Chars TABLE (y SMALLINT PRIMARY KEY, VALUE CHAR(1));

DECLARE @EditGraph
TABLE (
k SMALLINT,
x SMALLINT,
y SMALLINT,
x2 SMALLINT,
isConnectedAfter BIT
PRIMARY KEY(k,x)
);

DECLARE @V
TABLE (
k SMALLINT, -- Starting Node
x SMALLINT, -- Starting Node
y SMALLINT, -- Starting Node
LastK SMALLINT,
NodeCount SMALLINT,
PRIMARY KEY(k)
);

DECLARE @VPath
TABLE (
d SMALLINT,
k SMALLINT, -- Starting Node
x SMALLINT, -- Starting Node
y SMALLINT, -- Starting Node
LastK SMALLINT,
NodeCount SMALLINT,
PRIMARY KEY(d,k)
);

-------------------------------------------------------------------------------------------------------
-- Begin Code
-------------------------------------------------------------------------------------------------------

SET @String1Len = LEN(@String1);
SET @String2Len = LEN(@String2);
SET @StringMINLen = CASE WHEN @String1Len < @String2Len THEN @String1Len ELSE @String2Len END;
SET @StringMAXLen = CASE WHEN @String1Len < @String2Len THEN @String2Len ELSE @String1Len END;

INSERT @Number(Number)
SELECT n.Number
FROM dbo.udfNumberFast(@StringMAXLen) AS n;

INSERT @String1Chars(x,VALUE)
SELECT n.Number AS x,SUBSTRING(@String1,n.Number,1) AS VALUE
FROM @Number AS n
WHERE n.Number <= @String1Len;

INSERT @String2Chars(y,VALUE)
SELECT n.Number AS y,SUBSTRING(@String2,n.Number,1) AS VALUE
FROM @Number AS n
WHERE n.Number <= @String2Len;

DECLARE @msg VARCHAR(100);

SET @IDX_d = 0
WHILE @IDX_d <= @String1Len + @String2Len
BEGIN
SET @k_m = CASE WHEN @StringMINLen < @String2Len-@IDX_d THEN @StringMINLen ELSE @String2Len-@IDX_d END;
SET @k_n = CASE WHEN @StringMINLen < @String1Len-@IDX_d THEN @StringMINLen ELSE @String1Len-@IDX_d END;
SET @k_m = CASE WHEN @k_m < 0 THEN 0 ELSE @k_m END;
SET @k_n = CASE WHEN @k_n < 0 THEN 0 ELSE @k_n END;

;WITH cteDiagonalGridPoint
AS
(
SELECT -@IDX_d AS k,
n.Number AS x,
n.Number+@IDX_d AS y
FROM @Number AS n
WHERE n.Number <= @k_m
UNION
SELECT @IDX_d AS k,
n.Number+@IDX_d AS x,
n.Number AS y
FROM @Number AS n
WHERE n.Number <= @k_n
)
INSERT @EditGraph (k,x,y,x2)
SELECT d.k,d.x,d.y,d.x
FROM cteDiagonalGridPoint AS d
JOIN @String1Chars AS s1
ON d.x = s1.x
JOIN @String2Chars AS s2
ON d.y = s2.y
WHERE s1.Value = s2.Value;

UPDATE e1
SET e1.isConnectedAfter = 0
FROM @EditGraph AS e1
WHERE e1.k IN(-@IDX_d,@IDX_d)
AND NOT EXISTS
(
SELECT *
FROM @EditGraph AS e2
WHERE e1.k = e2.k AND e1.x = e2.x-1
);

UPDATE e1
SET e1.x2
= (
SELECT MIN(e2.x) AS x2
FROM @EditGraph AS e2
WHERE e1.k = e2.k
AND e1.x <= e2.x
AND e2.isConnectedAfter = 0
)
FROM @EditGraph AS e1
WHERE e1.k IN(-@IDX_d,@IDX_d);

-- DELETE nodes at the end of a tunnel that only span a single character
DELETE g
FROM @EditGraph AS g
WHERE g.x = g.x2
AND g.isConnectedAfter=0;

INSERT @v(k,x,y,LastK,NodeCount)
SELECT -@IDX_d,0,0,0,0
UNION
SELECT +@IDX_d,0,0,0,0

SET @IDX_k = -@IDX_d;
WHILE (@IDX_k <= @IDX_d)
BEGIN
SET @x_k1=COALESCE((SELECT v.x FROM @v AS v WHERE v.k=@IDX_k-1),0);
SET @x_k2=COALESCE((SELECT v.x FROM @v AS v WHERE v.k=@IDX_k+1),0);

-- Am I at the beginning of the loop? If so, take the k+1 diagonal since the k-1 doesn't exist
-- If I am not at the end and the k+1 diagonal is a farther path, use that, otherwise, use the k-1
IF (@IDX_k=-@IDX_d OR (@IDX_k < @IDX_d AND @x_k1 < @x_k2))
BEGIN
SET @x = @x_k2; -- Vertical Edge
SET @LastK = @IDX_k+1;
END
ELSE
BEGIN
SET @x = @x_k1+1; -- Horizontal Edge
SET @LastK = @IDX_k-1;
END

SET @y = @x - @IDX_k;
SET @NodeCount = 0;

-- Follow the snake (if one exists)!
SELECT @x = e.x2,
@y = e.x2-@IDX_k,
@NodeCount = e.x2-e.x+1
FROM @EditGraph AS e
WHERE e.k = @IDX_k
AND e.x = @x+1;

-- Update the current status of my path progress on this diagonal (k)
UPDATE v
SET x = @x,
y = @y,
LastK=@LastK,
NodeCount = @NodeCount
FROM @v AS v
WHERE k = @IDX_k;

-- Save my state so I can come back later and get the full path of the SES
INSERT @VPath SELECT @IDX_d,* FROM @v WHERE k=@IDX_k;

-- Check if I am at the end or I have somehow run out of d values
IF @x >= @String1Len AND @y >= @String2Len OR @IDX_d > @String1Len + @String2Len
BEGIN
GOTO END_LOOP;
END

SET @IDX_k = @IDX_k + 2;
END

SET @IDX_d = @IDX_d +1;
END

END_LOOP:

-- Go backwards to eliminate all path entries that didn't contribute to the SES
WHILE (@IDX_d > 0)
BEGIN

-- Delete all path entries that weren't part of the SES
DELETE p
FROM @VPath AS p
WHERE p.d = @IDX_d
AND NOT (p.k = @IDX_k);

-- Determine the last K that helped me get to this D (it identifies the path of the SES)
SELECT @IDX_k = p.LastK
FROM @VPath AS p
WHERE p.d = @IDX_d
AND p.k = @IDX_k;

SET @IDX_d = @IDX_d-1;
END

INSERT @Output
SELECT v.d,
v.k,
v.x,
v.y,
v.LastK,
v.NodeCount,
SUBSTRING(@String1,v.x-v.NodeCount+1,v.NodeCount) AS SubSequence
FROM @VPath AS v
WHERE v.NodeCount > 0
ORDER BY v.d,
v.k,
v.x;

RETURN;
END
GO
[/CODE]

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2014-07-03 : 13:08:58

I'm probably dull today, but doesn't this give the very same result?

DECLARE	@String1 VARCHAR(8000) = 'Aardvaark',
	@String2 VARCHAR(8000) = 'aard';

WITH cteString1(Pos, chr)
AS (
	SELECT	Number,
		SUBSTRING(@String1, Number, 1)
	FROM	dbo.udfNumberFast(LEN(@String1))
), cteString2(Pos, chr)
AS (
	SELECT	Number,
		SUBSTRING(@String2, Number, 1)
	FROM	dbo.udfNumberFast(LEN(@String2))
), cteSource(pos1, pos2)
AS (
	SELECT 		s1.pos AS po1,
			s2.pos AS pos2
	FROM		cteString1 AS s1
	INNER JOIN	cteString2 AS s2 ON s2.chr = s1.chr
	GROUP BY	s1.pos,
			s2.pos
)
SELECT		MIN(pos1) AS FromPos,
		MAX(pos1) AS ToPos,
		COUNT(*) AS Characters,
		SUBSTRING(@String1, MIN(pos1), COUNT(*)) AS Letters
FROM		cteSource
GROUP BY	pos1 - pos2
HAVING		COUNT(*) >= 2
ORDER BY	COUNT(*) DESC

Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-07-03 : 13:37:36

No, very different. The differences are visible with your example, however become far starker given other examples:



/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
1	4	4	Aard
6	8	3	aar
--*/

DECLARE	@String1 VARCHAR(8000) = 'Aardvaark',
	@String2 VARCHAR(8000) = 'aard';

SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d	k	x	y	LastK	NodeCount	String
0	0	4	4	1	4	Aard
--*/

Try this:



DECLARE	@String1 VARCHAR(8000),
	@String2 VARCHAR(8000);

SET @String1 = 'Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.';

SET @String2 = 'Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect. 

You can’t search code for the use of a column
That’s an interesting one. Say I’ve got a USERS table with a Title column.';

/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
1	145	145	Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.
3	133	27	lumns are returned in an un
13	136	20	returned in an unkno
10	133	20	re returned in an un
12	126	18	 returned in an un
1	114	18	Columns are return
13	144	17	returned in an un
4	91	17	umns are returned
3	122	16	lumns are return
8	127	16	 are returned in
6	142	16	ns are returned 
8	138	16	 are returned in
2	120	16	olumns are retur

... RESULTS REDUCED FOR BREVITY ...

122	132	2	or
9	22	2	ar
12	24	2	 r
106	123	2	nd
138	139	2	co
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d	k	x	y	LastK	NodeCount	String
0	0	145	145	1	145	Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.
--*/

The function dbo.udfCompareStrings correctly determines that both strings have the same sequence and only lists that sequence once in the final result. Try something like this to:



DECLARE	@String1 VARCHAR(8000),
	@String2 VARCHAR(8000);

SET @String1 = REPLICATE('<TESTURL>http://www.hgmailh.com</TESTURL> ',3);
SET @String2 = REPLICATE('<TESTURL>http://www.gmail.com</TESTURL>',3);

/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
1	107	24	<TESTURL>http://www.hgma
5	86	22	TURL>http://www.hgmail
11	104	22	ttp://www.hgmailh.com<
5	62	21	TURL>http://www.hgmai
1	44	21	<TESTURL>http://www.h
43	86	21	<TESTURL>http://www.h
37	118	21	TURL> <TESTURL>http:/
47	104	21	TURL>http://www.hgmai
1	20	20	<TESTURL>http://www.
19	76	20	w.hgmailh.com</TESTU
19	76	20	w.hgmailh.com</TESTU
61	118	20	w.hgmailh.com</TESTU
85	104	20	<TESTURL>http://www.
28	125	17	.com</TESTURL> <T
37	83	16	TURL> <TESTURL>h
79	125	16	TURL> <TESTURL>h
103	125	15	w.hgmailh.com</
19	41	15	w.hgmailh.com</
40	115	12	L> <TESTURL>
2	53	12	TESTURL>http
5	86	12	TURL>http://
44	95	12	TESTURL>http
37	118	12	TURL> <TESTU
5	76	11	TURL>http:/

... RESULTS REDUCED FOR BREVITY ...

82	89	2	L>
16	95	2	/w
63	112	2	hg
37	57	2	TU
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d	k	x	y	LastK	NodeCount	String
0	0	20	20	1	20	<TESTURL>http://www.
1	1	26	25	0	5	gmail
2	2	41	39	1	14	.com</TESTURL>
3	3	62	59	2	20	<TESTURL>http://www.
4	4	68	64	3	5	gmail
5	5	83	78	4	14	.com</TESTURL>
6	6	104	98	5	20	<TESTURL>http://www.
7	7	110	103	6	5	gmail
8	8	125	117	7	14	.com</TESTURL>
--*/

The algorithm is able to correctly line up the two very similar strings, with repeated segments, in order to find and eliminate the minor differences properly.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2014-07-03 : 15:33:23

So the difference is a TOP(1) ?

Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-07-03 : 16:27:16

No, that is not correct, TOP(1) would only give you the the longest common sub-sequence. Take a look at the last example... selecting TOP 1 with your function would not produce anything close to the correct result. In fact, that would yield "<TESTURL>http://www.hgma", which itself is not even common to both strings, thereby even failing the goal of achieving longest common sub-sequence. Even if you modified your method to only select the longest common sub-sequence between two strings, that is not what this function aims to achieve.

Consider the problem I outlined. If you want to take a configuration entry for a webservice from your development environment and compare it against the configuration entry from production in order to determine a general template that you could then use to create configuration entries for QA, UAT, etc...


DECLARE	@String1 VARCHAR(8000),@String2 VARCHAR(8000);

SET @String1 = '<connectionString>'+
					'<url>https://applications.companyname.com/APP1</url>'+
					'<proxy>http://proxy.companyname.com:5555</proxy>' +
					'<licenseKey>570038F0-739D-43EA-A105-FCA09A1466E9</licenseKey>' +
					'<username>15E10D713A9F4BE782954FEE10D0CD40</username>' +
					'<password>A67B5FEFD56E4D2A97A3B4CA28B583C8</password>' +
				'</connectionString>';
SET @String2 = '<connectionString>' + 
					'<url>https://VXABC123SERVER/APP1</url>' + 
					'<proxy>http://VXABC123SERVER:5555</proxy>' + 
					'<licenseKey>EC131372-1A3B-C964-80F7-69FE8B6991C1</licenseKey>' + 
					'<username>08EDFB1479DA42CAA1190B686AFFDAF1</username>' + 
					'<password>CBB30AF546B34A5499CFAB2C276B8C0A</password>' + 
				'</connectionString>';


/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
36	304	114	ications.companyname.com/APP1</url><proxy>http://proxy.companyname.com:5555</proxy><licenseKey>570038F0-739D-43EA-
39	290	38	tions.companyname.com/APP1</url><proxy
1	235	35	<connectionString><url>https://appl
65	285	27	</url><proxy>http://proxy.c
27	249	24	ps://applications.compan
12	229	21	String><url>https://a
28	297	21	s://applications.comp
39	302	19	tions.companyname.c
39	266	18	tions.companyname.
288	304	17	connectionString>
92	298	17	ompanyname.com:55
14	237	17	ring><url>https:/

... RESULTS REDUCED FOR BREVITY ...

228	287	2	na
31	52	2	/a
283	294	2	rd
45	68	2	co
7	36	2	ct
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
0	0	31	31	1	31	<connectionString><url>https://
42	14	84	70	15	25	/APP1</url><proxy>http://
77	21	130	109	22	25	:5555</proxy><licenseKey>
149	21	189	168	22	23	</licenseKey><username>
213	21	242	221	22	21	</username><password>
277	21	304	283	22	30	</password></connectionString>
--*/

The result from udfCompareStrings can be concatenated together to form a working configuration entry (sans variables). All that would be needed would be to add a placeholder for the variable name that could programatically replaced.

One thing I can see I need to adjust is the threshold for matches. Right now, it will take any 2 or greater combination, however I can easily see the need for dynamic adjustment.

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-07-03 : 17:04:17

I have reviewed the method you posted above and see where your problem lies:


SELECT		MIN(pos1) AS FromPos,
		MAX(pos1) AS ToPos,
		COUNT(*) AS Characters,
		SUBSTRING(@String1, MIN(pos1), COUNT(*)) AS Letters
FROM		cteSource
GROUP BY	pos1 - pos2
HAVING		COUNT(*) >= 2
ORDER BY	COUNT(*) DESC

You are grouping by the k value (diagonal aka pos1-pos2) and taking the first matching character and extending your string for the total number of matching characters on that diagonal. Unfortunately, this will produce incorrect results on all but the best case scenarios. The reason being is that many strings will have non-contiguous match points. Take for example this string comparison:


DECLARE	@String1 VARCHAR(8000),@String2 VARCHAR(8000);

SET @String1 = 'aaaaaBccccccccDeeeeeeee';
SET @String2 = 'aaaaaaFcccccccGeeeeeee';


/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
1	22	19	aaaaaBccccccccDeeee
1	21	18	aaaaaBccccccccDeee
2	23	17	aaaaBccccccccDeee
1	20	15	aaaaaBccccccccD
3	23	14	aaaBccccccccDe
1	19	12	aaaaaBcccccc
4	23	11	aaBcccccccc
1	18	9	aaaaaBccc
5	23	8	aBcccccc
1	17	6	aaaaaB
13	23	5	ccDee
7	16	3	ccc
14	23	3	cDe
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d	k	x	y	LastK	NodeCount	String
0	0	5	5	1	5	aaaaa
3	-1	13	14	0	7	ccccccc
6	0	22	22	1	7	eeeeeee
--*/

Only a single result returned from your method is common across both strings, 'ccc', and it is not even the complete sub sequence.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2014-07-03 : 19:59:40

Ok, I spent 30 minutes on this version vs 10 minutes on the previous

DECLARE	@String1 VARCHAR(8000) = 'aaaaaBccccccccDeeeeeeee',
	@String2 VARCHAR(8000) = 'aaaaaaFcccccccGeeeeeee';

WITH cteSource(FromPosition, ToPosition, Characters)
AS (
	SELECT		s.Number AS FromPosition,
			s.Number + MAX(l.Number) - 1 AS ToPosition,
			MAX(l.Number) AS Characters
	FROM		dbo.udfNumberFast(LEN(@String1)) AS s
	CROSS APPLY	dbo.udfNumberFast(LEN(@String1) - s.Number + 1) AS l
	WHERE		@String2 LIKE '%' + SUBSTRING(@String1, s.Number, l.Number) + '%'
	GROUP BY	s.Number
), cteDisplay(FromPosition, ToPosition, Characters, Expression, rnk)
AS (
	SELECT		MIN(FromPosition) AS FromPosition,
			ToPosition,
			MAX(Characters) AS Characters,
			SUBSTRING(@String1, MIN(FromPosition), MAX(Characters)) AS Expression,
			DENSE_RANK() OVER (PARTITION BY SUBSTRING(@String1, MIN(FromPosition), MAX(Characters)) ORDER BY MIN(FromPosition)) AS rnk
	FROM		cteSource
	GROUP BY	ToPosition
)
SELECT		FromPosition,
		ToPosition,
		Characters,
		Expression
FROM		cteDisplay
WHERE		rnk = 1
ORDER BY	Characters DESC,
		FromPosition;

Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA

lazerath
Constraint Violating Yak Guru

343 Posts

Posted - 2014-08-13 : 18:01:49

@SwePeso: Thank you for your input into this. Your solution is very clever and a very neat technique. I apologize for the delay in responding, but my wife and I had a child and I have been very busy.

That said, it is not perfect. Unfortunately, this method does not serve the same purpose.

As constructed, it does not work with repeating segments due to the grouping. This is a requirement of such string comparison. Compare the output with the repeated gmail example by comparing REPLICATE('<TESTURL>http://www.hgmailh.com</TESTURL> ',3) against REPLICATE('<TESTURL>http://www.gmail.com</TESTURL>',3)

Even if this is restructured to avoid the grouping, this method has problems with "refinding" smaller sequences that have already been found. Try comparing 'ThisIsAString,ThisAbsolutelyIsHis' against 'ThisIsAString'

Again, I really appreciate the attempt.

I have some optimizations and updates to the original PoC function that makes the whole thing much cleaner and faster. I will try to post tomorrow.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources