SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 General SQL Server Forums
 Script Library
 String Compare/Difference - Common Subsequences
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

lazerath
Constraint Violating Yak Guru

USA
342 Posts

Posted - 07/03/2014 :  11:12:26  Show Profile  Reply with Quote
I have a database of configuration entries with multiple, slightly different copies and was looking for an easy way to strip out the differences between versions to come up with a template for each one.

I searched the internet for a string comparison / difference function, procedure or CLR function and couldn't find one for SQL Server. Instead, I ran across a reference to a paper describing an algorithm and decided, what the heck, let's try it. The algorithm I used is the O(ND) Difference Algorithm over at http://www.xmailserver.org/diff2.pdf.

There are a couple points that I should note:
1.) The larger the strings are AND the more differences that exist between the strings, the slower this will go. For most use cases, this should be OK, but don't bother comparing > 1k strings with little to no similarities.
2.) It only returns common subsequences at this point -- ideally, I would like to have it return the differences as well. The algorithm itself is meant to produce Shortest Edit Script (SES), which is what you would use to visualize differences akin to a source code comparison tool.
3.) Ideally, this would be implemented as a CLR function and that will mitigate much of the issues.
4.) For my purposes, I decided it would work better if I didn't include single character matches in the end result. I intend to update the function to include a parameter to control this, but haven't gotten to it yet.
5.) Since performance dropped off once string size increases into the thousands, I decided to limit the string sizes to 8k.

Please feel free to point me to other implementations of difference algorithms, tell me this isn't the right layer to implement this, or offer constructive feedback to improve performance or features.

First, before we get to the algorithm, you will need a number generator or source. I chose to use a technique I lifted from Jeff Moden's excellent Split function, so I am providing it here:


GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF OBJECT_ID('dbo.udfNumberFast') IS NOT NULL DROP FUNCTION dbo.udfNumberFast;
GO
CREATE FUNCTION [dbo].[udfNumberFast]
(
	@TOTAL		INT
)
/*
This FUNCTION RETURNS an integer TABLE containing ALL integers
IN the range of @START_NUMBER through @END_NUMBER, inclusive.
The maximum number of rows that this FUNCTION can RETURN
IS 16777216.
*/
RETURNS TABLE 
AS
RETURN
(
--===== "Inline" CTE Driven "Tally Table" produces values from 0 up to 10,000...
     -- enough to cover NVARCHAR(4000)
  WITH E1(N) AS (
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL 
                 SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1 UNION ALL SELECT 1
                ),                          --10E+1 or 10 rows
       E2(N) AS (SELECT 1 FROM E1 a, E1 b), --10E+2 or 100 rows
       E4(N) AS (SELECT TOP (@TOTAL) 1 FROM E2 a, E2 b, E2 c, E2 d, E2 e), --10E+4 or 10,000 rows max
 cteTally(N) AS (--==== This provides the "base" CTE and limits the number of rows right up front
                     -- for both a performance gain and prevention of accidental "overruns"
                 SELECT  ROW_NUMBER() OVER (ORDER BY (SELECT NULL)) FROM E4
                )
SELECT		c.N AS Number
FROM		cteTally
AS			c
)
GO


Here is the string difference/compare function that finds the common sub-sequences between two strings:


GO
SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
IF OBJECT_ID('dbo.udfCompareStrings') IS NOT NULL DROP FUNCTION dbo.udfCompareStrings;
GO
CREATE FUNCTION [dbo].[udfCompareStrings]
(
	@String1 VARCHAR(8000), 
	@String2 VARCHAR(8000)
)
RETURNS @Output TABLE
        (
		d			SMALLINT, -- Number of non-diagonal edges to this point
		k			SMALLINT, -- Diagonal
		x			SMALLINT, -- Character X coordinate in grid
		y			SMALLINT, -- Character Y coordinate in grid
		LastK		SMALLINT, -- Breadcrumb of path
		NodeCount	SMALLINT, -- Number of characters in the diagonal
		String		VARCHAR(8000) 
		) 
AS
--	Implementation of the O(ND) Difference algorithm at:
--	http://www.xmailserver.org/diff2.pdf
--  Algorithm ported by Lazerath on 7/3/2014

BEGIN
	-------------------------------------------------------------------------------------------------------
	-- INITIALIZATION
	-------------------------------------------------------------------------------------------------------
	--SET NOCOUNT ON;
	DECLARE @DT DATETIME;
	SET @DT = GETDATE();

	DECLARE			@String1Len SMALLINT,
					@String2Len SMALLINT,
					@StringMINLen SMALLINT,
					@StringMAXLen SMALLINT,
					@k_m SMALLINT,
					@k_n SMALLINT,
					@IDX_d SMALLINT,
					@IDX_k SMALLINT,
					@x_k1 SMALLINT,
					@x_k2 SMALLINT,
					@NodeCount SMALLINT,
					@LastK SMALLINT,
					@x SMALLINT,
					@y SMALLINT;

	DECLARE			@Number			TABLE (Number SMALLINT PRIMARY KEY);
	DECLARE			@String1Chars	TABLE (x SMALLINT PRIMARY KEY, VALUE CHAR(1));
	DECLARE			@String2Chars	TABLE (y SMALLINT PRIMARY KEY, VALUE CHAR(1));

	DECLARE			@EditGraph
	TABLE			(
					k SMALLINT,
					x SMALLINT, 
					y SMALLINT,
					x2 SMALLINT,
					isConnectedAfter BIT
					PRIMARY KEY(k,x)
					);

	DECLARE			@V
	TABLE			(
					k SMALLINT,				-- Starting Node 
					x SMALLINT, 				-- Starting Node 
					y SMALLINT, 				-- Starting Node
					LastK SMALLINT, 
					NodeCount SMALLINT,
					PRIMARY KEY(k)
					);

	DECLARE			@VPath
	TABLE			(
					d SMALLINT,
					k SMALLINT,				-- Starting Node 
					x SMALLINT, 				-- Starting Node 
					y SMALLINT, 				-- Starting Node 
					LastK SMALLINT, 
					NodeCount SMALLINT,
					PRIMARY KEY(d,k)
					);

	-------------------------------------------------------------------------------------------------------
	-- Begin Code
	-------------------------------------------------------------------------------------------------------

	SET @String1Len = LEN(@String1);
	SET @String2Len = LEN(@String2);
	SET @StringMINLen = CASE WHEN @String1Len < @String2Len THEN @String1Len ELSE @String2Len END;
	SET @StringMAXLen = CASE WHEN @String1Len < @String2Len THEN @String2Len ELSE @String1Len END;

	INSERT		@Number(Number)
	SELECT		n.Number
	FROM		dbo.udfNumberFast(@StringMAXLen) AS n;

	INSERT		@String1Chars(x,VALUE)
	SELECT		n.Number AS x,SUBSTRING(@String1,n.Number,1) AS VALUE
	FROM		@Number AS n
	WHERE		n.Number <= @String1Len;

	INSERT		@String2Chars(y,VALUE)
	SELECT		n.Number AS y,SUBSTRING(@String2,n.Number,1) AS VALUE
	FROM		@Number AS n
	WHERE		n.Number <= @String2Len;

	DECLARE @msg VARCHAR(100);

	SET @IDX_d = 0
	WHILE @IDX_d <= @String1Len + @String2Len
	BEGIN
		SET @k_m = CASE WHEN @StringMINLen < @String2Len-@IDX_d THEN @StringMINLen ELSE @String2Len-@IDX_d END;
		SET @k_n = CASE WHEN @StringMINLen < @String1Len-@IDX_d THEN @StringMINLen ELSE @String1Len-@IDX_d END;
		SET @k_m = CASE WHEN @k_m < 0 THEN 0 ELSE @k_m END;
		SET @k_n = CASE WHEN @k_n < 0 THEN 0 ELSE @k_n END;
	
		;WITH		cteDiagonalGridPoint
		AS
		(
		SELECT		-@IDX_d AS k, 
					n.Number AS x, 
					n.Number+@IDX_d AS y
		FROM		@Number AS n
		WHERE		n.Number <= @k_m
		UNION
		SELECT		@IDX_d AS k, 
					n.Number+@IDX_d AS x, 
					n.Number AS y
		FROM		@Number AS n
		WHERE		n.Number <= @k_n
		)
		INSERT		@EditGraph (k,x,y,x2)
		SELECT		d.k,d.x,d.y,d.x
		FROM		cteDiagonalGridPoint AS d
		JOIN		@String1Chars AS s1
		ON			d.x = s1.x
		JOIN		@String2Chars AS s2
		ON			d.y = s2.y
		WHERE		s1.Value = s2.Value;
	
		UPDATE		e1
		SET			e1.isConnectedAfter = 0
		FROM		@EditGraph AS e1
		WHERE		e1.k IN(-@IDX_d,@IDX_d)
		AND NOT EXISTS
					(
					SELECT * 
					FROM @EditGraph AS e2
					WHERE e1.k = e2.k AND e1.x = e2.x-1
					);

		UPDATE		e1
		SET			e1.x2
		=			(
		SELECT		MIN(e2.x) AS x2
		FROM		@EditGraph AS e2
		WHERE		e1.k = e2.k
		AND			e1.x <= e2.x
		AND			e2.isConnectedAfter = 0
					)
		FROM		@EditGraph AS e1
		WHERE		e1.k IN(-@IDX_d,@IDX_d);

		-- DELETE nodes at the end of a tunnel that only span a single character
		DELETE		g
		FROM		@EditGraph AS g
		WHERE		g.x = g.x2
		AND			g.isConnectedAfter=0;

		INSERT		@v(k,x,y,LastK,NodeCount)
		SELECT		-@IDX_d,0,0,0,0
		UNION
		SELECT		+@IDX_d,0,0,0,0

		SET @IDX_k = -@IDX_d;
		WHILE (@IDX_k <= @IDX_d)
		BEGIN
			SET @x_k1=COALESCE((SELECT v.x FROM @v AS v WHERE v.k=@IDX_k-1),0);
			SET @x_k2=COALESCE((SELECT v.x FROM @v AS v WHERE v.k=@IDX_k+1),0);

			-- Am I at the beginning of the loop? If so, take the k+1 diagonal since the k-1 doesn't exist
			-- If I am not at the end and the k+1 diagonal is a farther path, use that, otherwise, use the k-1
			IF (@IDX_k=-@IDX_d OR (@IDX_k < @IDX_d AND @x_k1 < @x_k2))
			BEGIN
				SET @x =  @x_k2; -- Vertical Edge
				SET @LastK = @IDX_k+1;
			END
			ELSE
			BEGIN
				SET @x =  @x_k1+1; -- Horizontal Edge
				SET @LastK = @IDX_k-1;
			END
		
			SET @y = @x - @IDX_k;
			SET @NodeCount = 0;

			-- Follow the snake (if one exists)!
			SELECT		@x = e.x2,
						@y = e.x2-@IDX_k,
						@NodeCount = e.x2-e.x+1
			FROM		@EditGraph AS e
			WHERE		e.k = @IDX_k 
			AND			e.x = @x+1;

			-- Update the current status of my path progress on this diagonal (k)
			UPDATE		v
			SET			x = @x,
						y = @y,
						LastK=@LastK,
						NodeCount = @NodeCount
			FROM		@v AS v
			WHERE		k = @IDX_k;

			-- Save my state so I can come back later and get the full path of the SES
			INSERT @VPath SELECT @IDX_d,* FROM @v WHERE k=@IDX_k;

			-- Check if I am at the end or I have somehow run out of d values
			IF @x >= @String1Len AND @y >= @String2Len OR @IDX_d > @String1Len + @String2Len
			BEGIN
				GOTO END_LOOP;
			END

			SET @IDX_k = @IDX_k + 2;
		END

		SET @IDX_d = @IDX_d +1;
	END

	END_LOOP:

	-- Go backwards to eliminate all path entries that didn't contribute to the SES
	WHILE (@IDX_d > 0)
	BEGIN

		-- Delete all path entries that weren't part of the SES
		DELETE		p
		FROM		@VPath AS p
		WHERE		p.d = @IDX_d
		AND	 NOT	(p.k = @IDX_k);

		-- Determine the last K that helped me get to this D (it identifies the path of the SES)
		SELECT		@IDX_k = p.LastK
		FROM		@VPath AS p
		WHERE		p.d = @IDX_d
		AND			p.k = @IDX_k;

		SET @IDX_d = @IDX_d-1;
	END

	INSERT		@Output
	SELECT		v.d,
				v.k,
				v.x,
				v.y,
				v.LastK,
				v.NodeCount,
				SUBSTRING(@String1,v.x-v.NodeCount+1,v.NodeCount) AS SubSequence
	FROM		@VPath AS v
	WHERE		v.NodeCount > 0
	ORDER BY	v.d,
				v.k,
				v.x;

	RETURN;
END
GO


Edited by - lazerath on 07/03/2014 13:51:38

SwePeso
Patron Saint of Lost Yaks

Sweden
30281 Posts

Posted - 07/03/2014 :  13:08:58  Show Profile  Visit SwePeso's Homepage  Reply with Quote
I'm probably dull today, but doesn't this give the very same result?
DECLARE	@String1 VARCHAR(8000) = 'Aardvaark',
	@String2 VARCHAR(8000) = 'aard';

WITH cteString1(Pos, chr)
AS (
	SELECT	Number,
		SUBSTRING(@String1, Number, 1)
	FROM	dbo.udfNumberFast(LEN(@String1))
), cteString2(Pos, chr)
AS (
	SELECT	Number,
		SUBSTRING(@String2, Number, 1)
	FROM	dbo.udfNumberFast(LEN(@String2))
), cteSource(pos1, pos2)
AS (
	SELECT 		s1.pos AS po1,
			s2.pos AS pos2
	FROM		cteString1 AS s1
	INNER JOIN	cteString2 AS s2 ON s2.chr = s1.chr
	GROUP BY	s1.pos,
			s2.pos
)
SELECT		MIN(pos1) AS FromPos,
		MAX(pos1) AS ToPos,
		COUNT(*) AS Characters,
		SUBSTRING(@String1, MIN(pos1), COUNT(*)) AS Letters
FROM		cteSource
GROUP BY	pos1 - pos2
HAVING		COUNT(*) >= 2
ORDER BY	COUNT(*) DESC



Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA
Go to Top of Page

lazerath
Constraint Violating Yak Guru

USA
342 Posts

Posted - 07/03/2014 :  13:37:36  Show Profile  Reply with Quote
No, very different. The differences are visible with your example, however become far starker given other examples:



/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
1	4	4	Aard
6	8	3	aar
--*/

DECLARE	@String1 VARCHAR(8000) = 'Aardvaark',
	@String2 VARCHAR(8000) = 'aard';

SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d	k	x	y	LastK	NodeCount	String
0	0	4	4	1	4	Aard
--*/


Try this:


DECLARE	@String1 VARCHAR(8000),
	@String2 VARCHAR(8000);

SET @String1 = 'Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.';

SET @String2 = 'Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect. 

You can’t search code for the use of a column
That’s an interesting one. Say I’ve got a USERS table with a Title column.';

/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
1	145	145	Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.
3	133	27	lumns are returned in an un
13	136	20	returned in an unkno
10	133	20	re returned in an un
12	126	18	 returned in an un
1	114	18	Columns are return
13	144	17	returned in an un
4	91	17	umns are returned
3	122	16	lumns are return
8	127	16	 are returned in
6	142	16	ns are returned 
8	138	16	 are returned in
2	120	16	olumns are retur

... RESULTS REDUCED FOR BREVITY ...

122	132	2	or
9	22	2	ar
12	24	2	 r
106	123	2	nd
138	139	2	co
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d	k	x	y	LastK	NodeCount	String
0	0	145	145	1	145	Columns are returned in an unknown order
The fear is that an application will contain a bug when it depends on a column order that is incorrect.
--*/


The function dbo.udfCompareStrings correctly determines that both strings have the same sequence and only lists that sequence once in the final result. Try something like this to:



DECLARE	@String1 VARCHAR(8000),
	@String2 VARCHAR(8000);

SET @String1 = REPLICATE('<TESTURL>http://www.hgmailh.com</TESTURL> ',3);
SET @String2 = REPLICATE('<TESTURL>http://www.gmail.com</TESTURL>',3);

/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
1	107	24	<TESTURL>http://www.hgma
5	86	22	TURL>http://www.hgmail
11	104	22	ttp://www.hgmailh.com<
5	62	21	TURL>http://www.hgmai
1	44	21	<TESTURL>http://www.h
43	86	21	<TESTURL>http://www.h
37	118	21	TURL> <TESTURL>http:/
47	104	21	TURL>http://www.hgmai
1	20	20	<TESTURL>http://www.
19	76	20	w.hgmailh.com</TESTU
19	76	20	w.hgmailh.com</TESTU
61	118	20	w.hgmailh.com</TESTU
85	104	20	<TESTURL>http://www.
28	125	17	.com</TESTURL> <T
37	83	16	TURL> <TESTURL>h
79	125	16	TURL> <TESTURL>h
103	125	15	w.hgmailh.com</
19	41	15	w.hgmailh.com</
40	115	12	L> <TESTURL>
2	53	12	TESTURL>http
5	86	12	TURL>http://
44	95	12	TESTURL>http
37	118	12	TURL> <TESTU
5	76	11	TURL>http:/

... RESULTS REDUCED FOR BREVITY ...

82	89	2	L>
16	95	2	/w
63	112	2	hg
37	57	2	TU
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d	k	x	y	LastK	NodeCount	String
0	0	20	20	1	20	<TESTURL>http://www.
1	1	26	25	0	5	gmail
2	2	41	39	1	14	.com</TESTURL>
3	3	62	59	2	20	<TESTURL>http://www.
4	4	68	64	3	5	gmail
5	5	83	78	4	14	.com</TESTURL>
6	6	104	98	5	20	<TESTURL>http://www.
7	7	110	103	6	5	gmail
8	8	125	117	7	14	.com</TESTURL>
--*/


The algorithm is able to correctly line up the two very similar strings, with repeated segments, in order to find and eliminate the minor differences properly.

Edited by - lazerath on 07/03/2014 17:12:32
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

Sweden
30281 Posts

Posted - 07/03/2014 :  15:33:23  Show Profile  Visit SwePeso's Homepage  Reply with Quote
So the difference is a TOP(1) ?


Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA
Go to Top of Page

lazerath
Constraint Violating Yak Guru

USA
342 Posts

Posted - 07/03/2014 :  16:27:16  Show Profile  Reply with Quote
No, that is not correct, TOP(1) would only give you the the longest common sub-sequence. Take a look at the last example... selecting TOP 1 with your function would not produce anything close to the correct result. In fact, that would yield "<TESTURL>http://www.hgma", which itself is not even common to both strings, thereby even failing the goal of achieving longest common sub-sequence. Even if you modified your method to only select the longest common sub-sequence between two strings, that is not what this function aims to achieve.

Consider the problem I outlined. If you want to take a configuration entry for a webservice from your development environment and compare it against the configuration entry from production in order to determine a general template that you could then use to create configuration entries for QA, UAT, etc...


DECLARE	@String1 VARCHAR(8000),@String2 VARCHAR(8000);

SET @String1 = '<connectionString>'+
					'<url>https://applications.companyname.com/APP1</url>'+
					'<proxy>http://proxy.companyname.com:5555</proxy>' +
					'<licenseKey>570038F0-739D-43EA-A105-FCA09A1466E9</licenseKey>' +
					'<username>15E10D713A9F4BE782954FEE10D0CD40</username>' +
					'<password>A67B5FEFD56E4D2A97A3B4CA28B583C8</password>' +
				'</connectionString>';
SET @String2 = '<connectionString>' + 
					'<url>https://VXABC123SERVER/APP1</url>' + 
					'<proxy>http://VXABC123SERVER:5555</proxy>' + 
					'<licenseKey>EC131372-1A3B-C964-80F7-69FE8B6991C1</licenseKey>' + 
					'<username>08EDFB1479DA42CAA1190B686AFFDAF1</username>' + 
					'<password>CBB30AF546B34A5499CFAB2C276B8C0A</password>' + 
				'</connectionString>';


/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
36	304	114	ications.companyname.com/APP1</url><proxy>http://proxy.companyname.com:5555</proxy><licenseKey>570038F0-739D-43EA-
39	290	38	tions.companyname.com/APP1</url><proxy
1	235	35	<connectionString><url>https://appl
65	285	27	</url><proxy>http://proxy.c
27	249	24	ps://applications.compan
12	229	21	String><url>https://a
28	297	21	s://applications.comp
39	302	19	tions.companyname.c
39	266	18	tions.companyname.
288	304	17	connectionString>
92	298	17	ompanyname.com:55
14	237	17	ring><url>https:/

... RESULTS REDUCED FOR BREVITY ...

228	287	2	na
31	52	2	/a
283	294	2	rd
45	68	2	co
7	36	2	ct
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
0	0	31	31	1	31	<connectionString><url>https://
42	14	84	70	15	25	/APP1</url><proxy>http://
77	21	130	109	22	25	:5555</proxy><licenseKey>
149	21	189	168	22	23	</licenseKey><username>
213	21	242	221	22	21	</username><password>
277	21	304	283	22	30	</password></connectionString>
--*/


The result from udfCompareStrings can be concatenated together to form a working configuration entry (sans variables). All that would be needed would be to add a placeholder for the variable name that could programatically replaced.

One thing I can see I need to adjust is the threshold for matches. Right now, it will take any 2 or greater combination, however I can easily see the need for dynamic adjustment.

Edited by - lazerath on 07/03/2014 17:11:53
Go to Top of Page

lazerath
Constraint Violating Yak Guru

USA
342 Posts

Posted - 07/03/2014 :  17:04:17  Show Profile  Reply with Quote
I have reviewed the method you posted above and see where your problem lies:


SELECT		MIN(pos1) AS FromPos,
		MAX(pos1) AS ToPos,
		COUNT(*) AS Characters,
		SUBSTRING(@String1, MIN(pos1), COUNT(*)) AS Letters
FROM		cteSource
GROUP BY	pos1 - pos2
HAVING		COUNT(*) >= 2
ORDER BY	COUNT(*) DESC


You are grouping by the k value (diagonal aka pos1-pos2) and taking the first matching character and extending your string for the total number of matching characters on that diagonal. Unfortunately, this will produce incorrect results on all but the best case scenarios. The reason being is that many strings will have non-contiguous match points. Take for example this string comparison:


DECLARE	@String1 VARCHAR(8000),@String2 VARCHAR(8000);

SET @String1 = 'aaaaaBccccccccDeeeeeeee';
SET @String2 = 'aaaaaaFcccccccGeeeeeee';


/* -- Result from swepeso's function
FromPos	ToPos	Characters	Letters
1	22	19	aaaaaBccccccccDeeee
1	21	18	aaaaaBccccccccDeee
2	23	17	aaaaBccccccccDeee
1	20	15	aaaaaBccccccccD
3	23	14	aaaBccccccccDe
1	19	12	aaaaaBcccccc
4	23	11	aaBcccccccc
1	18	9	aaaaaBccc
5	23	8	aBcccccc
1	17	6	aaaaaB
13	23	5	ccDee
7	16	3	ccc
14	23	3	cDe
--*/


SELECT *
FROM dbo.udfCompareStrings
(
@String1,
@String2
)

/* -- Result from udfCompareStrings
d	k	x	y	LastK	NodeCount	String
0	0	5	5	1	5	aaaaa
3	-1	13	14	0	7	ccccccc
6	0	22	22	1	7	eeeeeee
--*/


Only a single result returned from your method is common across both strings, 'ccc', and it is not even the complete sub sequence.

Edited by - lazerath on 07/03/2014 17:08:48
Go to Top of Page

SwePeso
Patron Saint of Lost Yaks

Sweden
30281 Posts

Posted - 07/03/2014 :  19:59:40  Show Profile  Visit SwePeso's Homepage  Reply with Quote
Ok, I spent 30 minutes on this version vs 10 minutes on the previous
DECLARE	@String1 VARCHAR(8000) = 'aaaaaBccccccccDeeeeeeee',
	@String2 VARCHAR(8000) = 'aaaaaaFcccccccGeeeeeee';

WITH cteSource(FromPosition, ToPosition, Characters)
AS (
	SELECT		s.Number AS FromPosition,
			s.Number + MAX(l.Number) - 1 AS ToPosition,
			MAX(l.Number) AS Characters
	FROM		dbo.udfNumberFast(LEN(@String1)) AS s
	CROSS APPLY	dbo.udfNumberFast(LEN(@String1) - s.Number + 1) AS l
	WHERE		@String2 LIKE '%' + SUBSTRING(@String1, s.Number, l.Number) + '%'
	GROUP BY	s.Number
), cteDisplay(FromPosition, ToPosition, Characters, Expression, rnk)
AS (
	SELECT		MIN(FromPosition) AS FromPosition,
			ToPosition,
			MAX(Characters) AS Characters,
			SUBSTRING(@String1, MIN(FromPosition), MAX(Characters)) AS Expression,
			DENSE_RANK() OVER (PARTITION BY SUBSTRING(@String1, MIN(FromPosition), MAX(Characters)) ORDER BY MIN(FromPosition)) AS rnk
	FROM		cteSource
	GROUP BY	ToPosition
)
SELECT		FromPosition,
		ToPosition,
		Characters,
		Expression
FROM		cteDisplay
WHERE		rnk = 1
ORDER BY	Characters DESC,
		FromPosition;



Microsoft SQL Server MVP, MCT, MCSE, MCSA, MCP, MCITP, MCTS, MCDBA
Go to Top of Page

lazerath
Constraint Violating Yak Guru

USA
342 Posts

Posted - 08/13/2014 :  18:01:49  Show Profile  Reply with Quote
@SwePeso: Thank you for your input into this. Your solution is very clever and a very neat technique. I apologize for the delay in responding, but my wife and I had a child and I have been very busy.

That said, it is not perfect. Unfortunately, this method does not serve the same purpose.

  • As constructed, it does not work with repeating segments due to the grouping. This is a requirement of such string comparison. Compare the output with the repeated gmail example by comparing REPLICATE('<TESTURL>http://www.hgmailh.com</TESTURL> ',3) against REPLICATE('<TESTURL>http://www.gmail.com</TESTURL>',3)

  • Even if this is restructured to avoid the grouping, this method has problems with "refinding" smaller sequences that have already been found. Try comparing 'ThisIsAString,ThisAbsolutelyIsHis' against 'ThisIsAString'



Again, I really appreciate the attempt.

I have some optimizations and updates to the original PoC function that makes the whole thing much cleaner and faster. I will try to post tomorrow.
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.16 seconds. Powered By: Snitz Forums 2000