| Author |
Topic  |
|
Seventhnight
Flowing Fount of Yak Knowledge
USA
2878 Posts |
Posted - 02/01/2006 : 09:26:28
|
I think the issue with that is the need for a loop. You don't have to loop to deal with any number of duplicate spaces. What is important is that you choose a character that will not be found in the string.
Consider how much effort you are doing. What if you don't want to create a table??
This solution is inline and flexible. No tables or loops needed! 
Declare @ph varchar(10)
Set @ph = char(10) + char(13)
SELECT Replace(Replace(Replace(myCol,' ',' '+@ph),@ph+' ',''),@ph,'')
FROM (SELECT 'This is a test
of the emergency broadcast system!'
AS myCol) AS A
Corey
 Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."  |
 |
|
|
Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)
USA
6997 Posts |
Posted - 02/15/2006 : 19:14:08
|
I think this code is a good solution using a nested series of replaces to reduce a string of up to 8000 of the same character to single character with one statement.
The code below shows that any length of X's from 1 to 8000 X's can be reduced to a single X with just 6 nested replaces.
It is based on using the square root of each length as the number of characters to replace. For example, the square root of 8000 is about 89.44, so I replaced 89 X's with 1. This gave a maximum string length of 177, with a square root of 13.30, so I replaced 13 X's with 1 in the next cycle. And so on.
select
MAX_LENGTH = max(len(y))
from
(
select
y =
-- This series of replaces will convert any length
-- of repeating characters from 1 to 8000 to
-- a single character.
replace(replace(replace(replace(replace(
replace(x,replicate(CHR,89),CHR)
,replicate(CHR,13),CHR)
,replicate(CHR,5),CHR)
,replicate(CHR,3),CHR)
,replicate(CHR,2),CHR)
,replicate(CHR,2),CHR)
,x
from
(
select
x =replicate(CHR,number),
CHR
from
dbo.F_TABLE_NUMBER_RANGE(1,8000)
cross join
(select CHR = 'x') bb
)a )aa
Results:
MAX_LENGTH
-----------
1
(1 row(s) affected)
CODO ERGO SUM |
 |
|
|
Arnold Fribble
Yak-finder General
United Kingdom
1961 Posts |
|
|
Kristen
Test
United Kingdom
22191 Posts |
Posted - 02/16/2006 : 05:52:56
|
" replace(replace(replace(replace(replace( replace(x,replicate(CHR,89),CHR) ,replicate(CHR,13),CHR) ,replicate(CHR,5),CHR) ,replicate(CHR,3),CHR) ,replicate(CHR,2),CHR) ,replicate(CHR,2),CHR) "
Do you reckon this is more efficient than Corey's:
1) replace CHR-CHR with CHR-SPECIAL ** 2) replace SPECIAL-CHR with nothing 3) replace SPECIAL with nothing
??
** Actually Corey replaces replace (single) CHR with CHR-SPECIAL, but I think that is unnecessary as we only want to replace at double-CHR at a minimum, plus replacing CHR-CHR with CHR-SPECIAL does not extend the string (and possibly cause an overflow)
I am working on that approach over at http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61794 but no-one has replied yet 
Kristen |
 |
|
|
Arnold Fribble
Yak-finder General
United Kingdom
1961 Posts |
Posted - 02/16/2006 : 06:11:01
|
>Do you reckon this is more efficient than Corey's
Nah, I'd go the 3-replace as you say: it's bound to be quicker. Unless you're in the highly improbable situation where there are no characters guaranteed to be absent from the original text. Just don't use thorn!  |
Edited by - Arnold Fribble on 02/16/2006 06:11:30 |
 |
|
|
Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)
USA
6997 Posts |
Posted - 02/16/2006 : 06:55:01
|
A limitation of Corey's approach is that it increases the length of the string in an intermediate step, so you can overflow a varchar 8000 string. Since the first step doubles the characters you are replacing, it looks like 4000 or less would be the longest string of a single characters it could work with in a varchar 8000 column. Also, you do have to deal with finding an unused character. Seems like 2 hard to overcome limitations if you are looking for a general purpose method.
I didn't try anything over 8000 characters, because I was working with SQL Server 2000.
It might be interesting to actually test some of these against each other to see if there is a measurable difference in runtime, but it's hard to say which is more efficient unless you hold them to the same standard about what string lengths and types they can work with.
If the string length is much shorter, the solution I posted may require fewer replaces. If the string length is 8000, Corey's method would require at least one additional outer replace to reduce the length to a size it would work with.
CODO ERGO SUM |
 |
|
|
Kristen
Test
United Kingdom
22191 Posts |
Posted - 02/16/2006 : 07:07:18
|
"it increases the length of the string in an intermediate step"
I have changed Corey's approach from
Replace CHR with CHR-SPECIAL to Replace CHR-CHR with CHR-SPECIAL
This prevents increasing the length, and does away with replaces for single CHR instances, which are redundant anyway.
"finding an unused character"
I am planning to use a Tally table to find a suitable, unused, character, but this obviously adds a bit to the job in hand.
Kristen |
 |
|
|
Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)
USA
6997 Posts |
Posted - 02/16/2006 : 12:44:32
|
I ran some tests using my method, and Kristin's method with the code below.
I loaded a temp table with 4 different characters with every length from 1 to 8000 characters, for 32,000 rows ordered randomly. I ran the table through both convert methods. Both methods produced correct results.
I repeated this test a number of times, and got very consistant results. Kristin's method seems to run about 8 to 9 times as long.
drop table #temp
go
create table #temp
(
PK uniqueidentifier not null
primary key clustered,
CHR char(1) not null,
X varchar(8000) not null,
)
go
declare @start_time datetime
declare @end_time datetime
set @start_time = getdate()
-- Load table in random order
insert into #temp
select top 100 percent
PK,
CHR,
X
from
(
select
PK = newid(),
CHR,
X =replicate(CHR,number)
from
dbo.F_TABLE_NUMBER_RANGE(1,8000)
cross join
(
select CHR = 'x' union all
select CHR = 'Z' union all
select CHR = 'a' union all
select CHR = 'b'
) bb
) a
order by
PK
set @end_time = getdate()
select
[Load #temp millisec] =
datediff(ms,@start_time,@end_time),
[Start] = convert(varchar(23),@start_time,121),
[End] = convert(varchar(23),@end_time,121)
go
declare @start_time datetime
declare @end_time datetime
set @start_time = getdate()
select
CHR,
MIN_LEN_ORIGINAL = min(datalength(X)),
MAX_LEN_ORIGINAL = max(datalength(X)),
MIN_LEN_RESULT = min(datalength(Y)),
MAX_LEN_RESULT = max(datalength(Y)),
ROW_COUNT = count(*)
from
(
select
y =
replace(replace(replace(replace(replace(
replace(x,replicate(CHR,89),CHR)
,replicate(CHR,13),CHR)
,replicate(CHR,5),CHR)
,replicate(CHR,3),CHR)
,replicate(CHR,2),CHR)
,replicate(CHR,2),CHR),
X,
CHR
from
#temp
) a
group by
CHR
order by
CHR
set @end_time = getdate()
select
[Method 1 millisec] =
datediff(ms,@start_time,@end_time),
[Start] = convert(varchar(23),@start_time,121),
[End] = convert(varchar(23),@end_time,121)
go
declare @start_time datetime
declare @end_time datetime
set @start_time = getdate()
select
CHR,
MIN_LEN_ORIGINAL = min(datalength(X)),
MAX_LEN_ORIGINAL = max(datalength(X)),
MIN_LEN_RESULT = min(datalength(Y)),
MAX_LEN_RESULT = max(datalength(Y)),
ROW_COUNT = count(*)
from
(
select
y =
replace(replace(replace(
X,CHR+CHR,CHR+'þ'),'þ'+CHR,''),'þ',''),
X,
CHR
from
#temp
) a
group by
CHR
order by
CHR
set @end_time = getdate()
select
[Method 2 millisec] =
datediff(ms,@start_time,@end_time),
[Start] = convert(varchar(23),@start_time,121),
[End] = convert(varchar(23),@end_time,121)
Results:
(32000 row(s) affected)
Load #temp millisec Start End
------------------- ----------------------- -----------------------
18983 2006-02-16 12:29:52.513 2006-02-16 12:30:11.497
(1 row(s) affected)
CHR MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT
---- ---------------- ---------------- -------------- -------------- -----------
a 1 8000 1 1 8000
b 1 8000 1 1 8000
x 1 8000 1 1 8000
Z 1 8000 1 1 8000
(4 row(s) affected)
Method 1 millisec Start End
----------------- ----------------------- -----------------------
5496 2006-02-16 12:30:11.560 2006-02-16 12:30:17.057
(1 row(s) affected)
CHR MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT
---- ---------------- ---------------- -------------- -------------- -----------
a 1 8000 1 1 8000
b 1 8000 1 1 8000
x 1 8000 1 1 8000
Z 1 8000 1 1 8000
(4 row(s) affected)
Method 2 millisec Start End
----------------- ----------------------- -----------------------
47610 2006-02-16 12:30:17.073 2006-02-16 12:31:04.683
(1 row(s) affected)
CODO ERGO SUM |
 |
|
|
Kristen
Test
United Kingdom
22191 Posts |
Posted - 02/16/2006 : 13:06:11
|
"Kristin's method seems to run about 8 to 9 times as long"
In that case its Corey's method!!
I'll take a look, looks very interesting, thanks.
Kristen |
 |
|
|
Arnold Fribble
Yak-finder General
United Kingdom
1961 Posts |
Posted - 02/16/2006 : 13:57:27
|
That test data does not strike me as representative of most use cases.
|
 |
|
|
Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)
USA
6997 Posts |
Posted - 02/16/2006 : 14:45:00
|
quote: Originally posted by Arnold Fribble
That test data does not strike me as representative of most use cases.
I agree; I never claimed it was representative of most use cases.
It is, however, a quantifiable, repeatable result with a good amount of data, and it does show that Kristin's method is not "bound to be quicker".
Feel free to post some test results from representative cases. 
CODO ERGO SUM |
 |
|
|
Seventhnight
Flowing Fount of Yak Knowledge
USA
2878 Posts |
Posted - 02/16/2006 : 17:01:55
|
Here is a different repeatable test that shows the opposite:
drop table #temp
go
create table #temp
(
PK uniqueidentifier not null primary key clustered,
CHR char(1) not null,
X varchar(8000) not null,
)
go
declare @start_time datetime
declare @end_time datetime
set @start_time = getdate()
-- Load table in random order
insert into #temp
select top 100 percent
PK,
CHR,
X
from
(
select
PK = newid(),
CHR,
X=replicate(CHR,number)
from
dbo.getSequence(1,500,1)
cross join
(
select CHR = 'x' union all
select CHR = 'Z' union all
select CHR = 'a' union all
select CHR = 'b'
) bb
) a
order by
PK
declare @loop int,
@pos1 int,
@pos2 int
Set @loop = 0
While @loop < 10--100
Begin
Update #temp
Set
@pos1 = pos1%len(a.X),
@pos2 = pos2%len(a.X),
X = case
when @pos1=0 or @pos2=0 or @pos1=@pos2 then
a.X
when @pos1 < @pos2 then
stuff(a.x,@pos1,@pos2-@pos1,replicate(newChar,@pos2-@pos1))
when @pos2 < @pos1 then
stuff(a.x,@pos2,@pos1-@pos2,replicate(newChar,@pos1-@pos2))
else a.X end
From #temp a
Cross Join
(
Select
newChar = dbo.getRandomChar(),
Pos1 = (dbo.getRandomDigit()*1000 + dbo.getRandomDigit()*100 + dbo.getRandomDigit()*10 + dbo.getRandomDigit()*1),
Pos2 = (dbo.getRandomDigit()*1000 + dbo.getRandomDigit()*100 + dbo.getRandomDigit()*10 + dbo.getRandomDigit()*1)
) b
Set @Loop = @Loop + 1
ENd
Select * From #temp
set @end_time = getdate()
select
[Load #temp millisec] =
datediff(ms,@start_time,@end_time),
[Start] = convert(varchar(23),@start_time,121),
[End] = convert(varchar(23),@end_time,121)
go
declare @start_time datetime
declare @end_time datetime
set @start_time = getdate()
select
CHR,
MIN_LEN_ORIGINAL = min(datalength(X)),
MAX_LEN_ORIGINAL = max(datalength(X)),
MIN_LEN_RESULT = min(datalength(Y)),
MAX_LEN_RESULT = max(datalength(Y)),
ROW_COUNT = count(*)
from
(
select
y =
replace(replace(replace(replace(replace(
replace(x,replicate(CHR,89),CHR)
,replicate(CHR,13),CHR)
,replicate(CHR,5),CHR)
,replicate(CHR,3),CHR)
,replicate(CHR,2),CHR)
,replicate(CHR,2),CHR),
X,
CHR
from #temp
) a
group by CHR
order by CHR
set @end_time = getdate()
select
[Method 1 millisec] =
datediff(ms,@start_time,@end_time),
[Start] = convert(varchar(23),@start_time,121),
[End] = convert(varchar(23),@end_time,121)
go
declare @start_time datetime
declare @end_time datetime
declare @UNIQ varchar(2)
Set @UNIQ = 'þ'
set @start_time = getdate()
select
CHR,
MIN_LEN_ORIGINAL = min(datalength(X)),
MAX_LEN_ORIGINAL = max(datalength(X)),
MIN_LEN_RESULT = min(datalength(Y)),
MAX_LEN_RESULT = max(datalength(Y)),
ROW_COUNT = count(*)
from
(
select
y =
replace(replace(replace(
X,replicate(CHR,2),CHR+@UNIQ),@UNIQ+CHR,''),@UNIQ,''),
X,
CHR
from #temp
) a
group by CHR
order by CHR
set @end_time = getdate()
select
[Method 2 millisec] =
datediff(ms,@start_time,@end_time),
[Start] = convert(varchar(23),@start_time,121),
[End] = convert(varchar(23),@end_time,121)
with results:
(2000 row(s) affected)
Load #temp millisec Start End
------------------- ----------------------- -----------------------
40676 2006-02-16 16:59:00.920 2006-02-16 16:59:41.597
(1 row(s) affected)
CHR MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT
---- ---------------- ---------------- -------------- -------------- -----------
a 1 500 1 495 500
b 1 500 1 497 500
x 1 500 1 496 500
Z 1 500 1 497 500
(4 row(s) affected)
Method 1 millisec Start End
----------------- ----------------------- -----------------------
303 2006-02-16 16:59:41.687 2006-02-16 16:59:41.990
(1 row(s) affected)
CHR MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT
---- ---------------- ---------------- -------------- -------------- -----------
a 1 500 1 495 500
b 1 500 0 497 500
x 1 500 1 496 500
Z 1 500 1 494 500
(4 row(s) affected)
Method 2 millisec Start End
----------------- ----------------------- -----------------------
150 2006-02-16 16:59:42.000 2006-02-16 16:59:42.150
(1 row(s) affected)
random character and digit functions provided on request.
Corey
 Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."  |
Edited by - Seventhnight on 02/16/2006 17:04:52 |
 |
|
|
Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)
USA
6997 Posts |
Posted - 02/16/2006 : 18:11:13
|
It looks like the test that Corey ran had an error for CHR = b for the second method. I don't think there should be a case where the string would be reduced to zero length.
CHR MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT
---- ---------------- ---------------- -------------- -------------- -----------
a 1 500 1 495 500
b 1 500 0 497 500
x 1 500 1 496 500
Z 1 500 1 494 500
CODO ERGO SUM |
 |
|
|
Seventhnight
Flowing Fount of Yak Knowledge
USA
2878 Posts |
Posted - 02/16/2006 : 19:28:52
|
don't know if i can duplicate, but I'll try tomorrow 
Corey
 Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."  |
 |
|
|
Kristen
Test
United Kingdom
22191 Posts |
Posted - 02/17/2006 : 01:38:21
|
In my case I usually have the odd double, triple, and occasionally worse, substring to be reduced. I don't have a massive substring to be reduced, but I'd obviously like my function to handle that should it arise.
I extracted 4,000,000 rows of data which contains '00' and multiple '0...0' into a one column table, and ran some tests on it.
Max length of column = 1000, Average = 94
Count of rows containing: 00 4,321,659 000 1,768,588 0000 10,916 00000 1,499 000000 705
Corey's method 33 seconds MVJ method 56 seconds
which is about the ratio that Corey's tests gave, so on my "average data" I reckon that is the method I need.
Kristen |
 |
|
|
Arnold Fribble
Yak-finder General
United Kingdom
1961 Posts |
Posted - 02/17/2006 : 03:54:46
|
Oh, hark, I hear Annie Lennox singing... :þ
|
Edited by - Arnold Fribble on 02/17/2006 03:56:41 |
 |
|
|
Seventhnight
Flowing Fount of Yak Knowledge
USA
2878 Posts |
Posted - 02/17/2006 : 09:11:26
|
quote: Originally posted by Michael Valentine Jones
It looks like the test that Corey ran had an error for CHR = b for the second method. I don't think there should be a case where the string would be reduced to zero length.
CHR MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT
---- ---------------- ---------------- -------------- -------------- -----------
a 1 500 1 495 500
b 1 500 0 497 500
x 1 500 1 496 500
Z 1 500 1 494 500
CODO ERGO SUM
By the way... this was becuase I used a 'g' as the @uniq... i changed it to þ and no problems 
Corey
 Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."  |
 |
|
|
Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)
USA
6997 Posts |
Posted - 02/17/2006 : 10:06:41
|
quote: Originally posted by Arnold Fribble
Oh, hark, I hear Annie Lennox singing... :þ
Annie Lennox??? I don't get the reference
On a different note, it looks like the method that is fastest is dependant on the content of the data and the length of strings to be replaced.
Does anyone have a good solution to the problem of picking an unused token character in Corey's method, or do you just have to take your chances? My guess is that searching the data for an unused character would really slow it down.
CODO ERGO SUM |
 |
|
|
Arnold Fribble
Yak-finder General
United Kingdom
1961 Posts |
Posted - 02/17/2006 : 10:41:29
|
quote: Originally posted by Michael Valentine Jones Annie Lennox??? I don't get the reference
"Thorn in my side". I'd imagined that the thorn character had gotten its "Revenge" after my earlier warnings.
|
 |
|
|
Arnold Fribble
Yak-finder General
United Kingdom
1961 Posts |
Posted - 02/17/2006 : 11:21:45
|
quote:
Does anyone have a good solution to the problem of picking an unused token character in Corey's method, or do you just have to take your chances? My guess is that searching the data for an unused character would really slow it down.
I was initially going to suggest to Kristen in that other thread ( http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61794 ) that CHAR(0) would be a safe choice, but it has certain issues in some collations.
If the column is nvarchar, then any of the 32 codepoints U+FDD0..U+FDEF should be safe. These are defined as noncharacters in Unicode so should never be present in stored text. Though SQL Server doesn't know that.
|
Edited by - Arnold Fribble on 02/17/2006 11:28:37 |
 |
|
Topic  |
|
|
|