Please start any new threads on our new
site at http://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.

Our new SQL Server Forums are live!
Come on over! We've restricted the ability to create new threads on these forums.

quote:quote: -------------------------------------------------------------------------------- Anyway the GP's are the 'good' guys. --------------------------------------------------------------------------------

Oh, sorry, it wasn't the GPs themselves I was complaining about,

Arnold, I knew you weren't having a go at them. I just wondered how you dealt with the consultants were the data is even worse, to my understanding it's only updated every couple of years at best and much of it is wrong.

I knew about GP's in different surgeries but I hadn't realised that a new code was created for each. As I'm about to try to tidy up the data on my system for these that is a really important snippet. Maybe I'll just wait for our Brave New World!!

Many thanks

steve

(maybe my sig should say NACS instead of Brain)

Alright Brain, you don't like me, and I don't like you. But lets just do this, and I can get back to killing you with beer.

quote:Arnold, I knew you weren't having a go at them. I just wondered how you dealt with the consultants were the data is even worse, to my understanding it's only updated every couple of years at best and much of it is wrong.

To be honest, national coding of consultants isn't something we've had to worry about yet. But I'm sure in this new age of NPfIT or Connecting for Health or whatever the hell it's called this week...

quote:maybe my sig should say NACS instead of Brain

Here's a version that I created to meet my specific needs:

CREATE FUNCTION [dbo].[zfsMin3] (@a INT, @b INT, @c INT ) RETURNS INT AS

BEGIN DECLARE @Retval INT IF @a <= @b AND @a <= @c SET @Retval = @a

IF @b < @a AND @b <= @c SET @Retval = @b

IF @c < @a AND @c < @b SET @Retval = @c

RETURN @Retval

END

CREATE FUNCTION [dbo].[zfsLevenshtein] (@Source VARCHAR(2000), @Target VARCHAR(2000) ) --Retranslated by Jerry Johnson, 8/2/2006 RETURNS INT AS

BEGIN

DECLARE @Distance INT, @Cost INT DECLARE @SourceLength INT, @TargetLength INT DECLARE @SourceCounter INT, @TargetCounter INT DECLARE @Matrix TABLE (X INT, Y INT, Value INT) DECLARE @Above INT, @Left INT, @Diagonal INT

--Preprocess to remove any characters that already match DECLARE @MinLength INTEGER DECLARE @NewSource VARCHAR(50), @NewTarget VARCHAR(50)

SET @MinLength = LEN(@Source) IF LEN(@Target) < @MinLength SET @MinLength = LEN(@Target)

SET @SourceCounter = 1 SET @NewSource = '' SET @NewTarget = ''

WHILE @SourceCounter <= @MinLength BEGIN IF SUBSTRING(@Source, @SourceCounter, 1) <> SUBSTRING(@Target, @SourceCounter, 1) BEGIN SET @NewSource = @NewSource + SUBSTRING(@Source, @SourceCounter, 1) SET @NewTarget = @NewTarget + SUBSTRING(@Target, @SourceCounter, 1) END SET @SourceCounter = @SourceCounter + 1 END

SET @NewSource = @NewSource + SUBSTRING(@Source, @MinLength + 1, LEN(@Source)) SET @NewTarget = @NewTarget + SUBSTRING(@Target, @MinLength + 1, LEN(@Target)) SET @Source = @NewSource SET @Target = @NewTarget

--Step 1 SET @SourceLength = LEN(@Source) SET @TargetLength = LEN(@Target)

IF @SourceLength = 0 BEGIN SET @Distance = @TargetLength GOTO Done END

IF @TargetLength = 0 BEGIN SET @Distance = @SourceLength GOTO Done END

--Step 2 SET @SourceCounter = 1 SET @TargetCounter = 1

INSERT @Matrix VALUES(0, 0, 0)

WHILE @SourceCounter <= @SourceLength BEGIN INSERT @Matrix VALUES(@SourceCounter, 0, @SourceCounter) SET @SourceCounter = @SourceCounter + 1 END

WHILE @TargetCounter <= @TargetLength BEGIN INSERT @Matrix VALUES(0, @TargetCounter, @TargetCounter) SET @TargetCounter = @TargetCounter + 1 END

--Steps 3 and 4 SET @SourceCounter = 1 SET @TargetCounter = 1

WHILE @SourceCounter <= @SourceLength BEGIN WHILE @TargetCounter <= @TargetLength BEGIN

--Step 5 IF @SourceCounter > @TargetLength OR @TargetCounter > @SourceLength SET @Cost = 1 ELSE IF SUBSTRING(@Source, @SourceCounter, 1) = SUBSTRING(@Target, @TargetCounter, 1) SET @Cost = 0 ELSE SET @Cost = 1

--Step 6 SET @Above = (SELECT Value FROM @Matrix WHERE X = @SourceCounter AND Y = @TargetCounter - 1) SET @Left = (SELECT Value FROM @Matrix WHERE X = @SourceCounter - 1 AND Y = @TargetCounter) SET @Diagonal = (SELECT Value FROM @Matrix WHERE X = @SourceCounter - 1 AND Y = @TargetCounter - 1)

I stumbled across an inconsistency with this function. It appears that is @s1_len = 0 (i.e. if the first argument is an empty string), the function returns 0, as if it is perfect match to 2nd string.

I suggest adding this line after @s1_len is calculated:

I didn't say it was fast for big strings, just that it worked! Actually, the reason I posted it this week was that the old version used a temporary table (i,j,cost) and that was ludicrously slow even on short strings, so I rewrote it.

Any ideas on how to make it faster, though, would be handy. Though I don't think any simple changes are going to make it asymptotically better than O(m*n)

Maybe putting the algorithm in a UDA will yield better performance?

Here is a version of the Levenshtein algorithm that has a 3999-character limit and works with a one-dimensional array (without it simulating a two-dimensional array). My thanks to Arnold for showing me some tricks to use.

Some notes:

It first checks the length of each string and returns the length of the other if one is empty.

It then checks the strings for equivalence and returns zero if they are the same.

It then checks to see if one string is a sub-string of the other and returns the difference in length if such is the case.

Only after these checks have failed will it run the algorithm against the two strings.

Notice that the array length only needs to be as long as the shorter string, but both are confined to 3,999 characters to ensure that the array will be large enough. I expect you can use a nVarChar(Max) as long as neither string exceeds a length of 32,766 characters.

CREATE FUNCTION [dbo].[levenshtein](@s1 nVarChar(3999), @s2 nVarChar(3999))
RETURNS Int
AS BEGIN
DECLARE @sLeft nVarChar(3999), @sRight nVarChar(3999), @nLeftLength Int, @nRightLength Int,
@nsDistance nVarChar(4000), @nLeftPos Int, @nRightPos Int, @nMatrixDiag Int, @nMatrixSide Int, @nMatrixVert Int,
@cLeft nVarChar(1), @cRight nVarChar(1), @nCost Int, @nMatrixTemp Int;
IF (Len(@s1) = 0) RETURN Len(@s2)
ELSE IF (Len(@s2) = 0) RETURN Len(@s1)
ELSE IF (@s1 = @s2) RETURN 0
ELSE IF ((Len(@s1) < Len(@s2)) AND (CharIndex(@s1, @s2) <> 0)) RETURN Len(@s2) - Len(@s1)
ELSE IF ((Len(@s2) < Len(@s1)) AND (CharIndex(@s2, @s1) <> 0)) RETURN Len(@s1) - Len(@s2)
ELSE BEGIN
IF (Len(@s1) > Len(@s2)) BEGIN
SET @sLeft = @s1;
SET @sRight = @s2;
END ELSE BEGIN
SET @sLeft = @s2;
SET @sRight = @s1;
END;
SET @nLeftLength = Len(@sLeft);
SET @nRightLength = Len(@sRight);
SET @nsDistance = NChar(1);
SET @nRightPos = 1;
WHILE (@nRightPos <= @nRightLength) BEGIN
SET @nsDistance = @nsDistance + NChar(@nRightPos + 1);
SET @nRightPos = @nRightPos + 1;
END;
SET @nLeftPos = 1;
WHILE (@nLeftPos <= @nLeftLength) BEGIN
SET @cLeft = SubString(@sLeft, @nLeftPos, 1);
SET @nMatrixDiag = Unicode(SubString(@nsDistance, 1, 1));
SET @nsDistance = NChar(@nLeftPos + 1) + SubString(@nsDistance, 2, @nRightLength);
SET @nRightPos = 1;
WHILE (@nRightPos <= @nRightLength) BEGIN
SET @cRight = SubString(@sRight, @nRightPos, 1);
SET @nCost = CASE WHEN (@cRight = @cLeft) THEN 0 ELSE 1 END;
SET @nMatrixSide = Unicode(SubString(@nsDistance, @nRightPos, 1));
SET @nMatrixVert = Unicode(SubString(@nsDistance, @nRightPos + 1, 1));
SET @nMatrixTemp = CASE
WHEN ((@nMatrixDiag + @nCost <= @nMatrixSide + 1) AND (@nMatrixDiag + @nCost <= @nMatrixVert + 1))
THEN @nMatrixDiag + @nCost
WHEN ((@nMatrixSide + 1 <= @nMatrixDiag + @nCost) AND (@nMatrixSide + 1 <= @nMatrixVert + 1))
THEN @nMatrixSide + 1
ELSE @nMatrixVert + 1
END;
IF (@nRightPos < @nRightLength)
SET @nsDistance = SubString(@nsDistance, 1, @nRightPos) + NChar(@nMatrixTemp)
+ SubString(@nsDistance, @nRightPos + 2, @nRightLength - @nRightPos)
ELSE
SET @nsDistance = SubString(@nsDistance, 1, @nRightPos) + NChar(@nMatrixTemp);
SET @nMatrixDiag = @nMatrixVert;
SET @nRightPos = @nRightPos + 1;
END;
SET @nLeftPos = @nLeftPos + 1;
END;
END;
RETURN Unicode(SubString(@nsDistance, @nRightLength + 1, 1)) - 1;
END;

I use SQL Function and the SQL query at link below. Works great for me. These 2 togethger give me the best search results like "google's did you mean this"