SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 SQL Server 2000 Forums
 SQL Server Development (2000)
 Too Many Spaces Between Words
 New Topic  Reply to Topic
 Printer Friendly
Previous Page | Next Page
Author Previous Topic Topic Next Topic
Page: of 3

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

Posted - 02/01/2006 :  09:26:28  Show Profile  Visit Seventhnight's Homepage  Reply with Quote
I think the issue with that is the need for a loop. You don't have to loop to deal with any number of duplicate spaces. What is important is that you choose a character that will not be found in the string.

Consider how much effort you are doing. What if you don't want to create a table??

This solution is inline and flexible. No tables or loops needed!

Declare @ph varchar(10)
Set @ph = char(10) + char(13)

SELECT Replace(Replace(Replace(myCol,' ',' '+@ph),@ph+' ',''),@ph,'')
FROM (SELECT 'This    is                   a                          test 
of      the      emergency    broadcast     system!' 
	AS myCol) AS A


Corey

Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 02/15/2006 :  19:14:08  Show Profile  Reply with Quote
I think this code is a good solution using a nested series of replaces to reduce a string of up to 8000 of the same character to single character with one statement.

The code below shows that any length of X's from 1 to 8000 X's can be reduced to a single X with just 6 nested replaces.

It is based on using the square root of each length as the number of characters to replace. For example, the square root of 8000 is about 89.44, so I replaced 89 X's with 1. This gave a maximum string length of 177, with a square root of 13.30, so I replaced 13 X's with 1 in the next cycle. And so on.


select
	MAX_LENGTH = max(len(y))
from
(
select
	y =
	-- This series of replaces will convert any length
	-- of repeating characters from 1 to 8000 to
	-- a single character.
	replace(replace(replace(replace(replace(
	replace(x,replicate(CHR,89),CHR)
	,replicate(CHR,13),CHR)
	,replicate(CHR,5),CHR)
	,replicate(CHR,3),CHR)
	,replicate(CHR,2),CHR)
	,replicate(CHR,2),CHR)
	,x
from
(
	select
		x =replicate(CHR,number),
		CHR
	from
		dbo.F_TABLE_NUMBER_RANGE(1,8000)
		cross join
		(select CHR = 'x') bb
)a )aa

Results:

MAX_LENGTH  
----------- 
1

(1 row(s) affected)


CODO ERGO SUM
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 02/16/2006 :  03:33:17  Show Profile  Reply with Quote
Actually, Michael, a sequence of 6 replaces that can reduce the longest run of spaces to a single space is 231, 21, 6, 3, 2, 2, which will reduce runs of anything up to 53591 spaces.

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=21511

Proving this is left as an excerise for the reader.
(Hint: http://www.research.att.com/~njas/sequences/A007501
or in Haskell iterate (\n->n*(n+1)`div`2) 2)

Edited by - Arnold Fribble on 02/16/2006 04:46:09
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 02/16/2006 :  05:52:56  Show Profile  Reply with Quote
" replace(replace(replace(replace(replace(
replace(x,replicate(CHR,89),CHR)
,replicate(CHR,13),CHR)
,replicate(CHR,5),CHR)
,replicate(CHR,3),CHR)
,replicate(CHR,2),CHR)
,replicate(CHR,2),CHR)
"

Do you reckon this is more efficient than Corey's:

1) replace CHR-CHR with CHR-SPECIAL **
2) replace SPECIAL-CHR with nothing
3) replace SPECIAL with nothing

??

** Actually Corey replaces replace (single) CHR with CHR-SPECIAL, but I think that is unnecessary as we only want to replace at double-CHR at a minimum, plus replacing CHR-CHR with CHR-SPECIAL does not extend the string (and possibly cause an overflow)

I am working on that approach over at
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61794
but no-one has replied yet

Kristen
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 02/16/2006 :  06:11:01  Show Profile  Reply with Quote
>Do you reckon this is more efficient than Corey's

Nah, I'd go the 3-replace as you say: it's bound to be quicker. Unless you're in the highly improbable situation where there are no characters guaranteed to be absent from the original text. Just don't use thorn!

Edited by - Arnold Fribble on 02/16/2006 06:11:30
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 02/16/2006 :  06:55:01  Show Profile  Reply with Quote
A limitation of Corey's approach is that it increases the length of the string in an intermediate step, so you can overflow a varchar 8000 string. Since the first step doubles the characters you are replacing, it looks like 4000 or less would be the longest string of a single characters it could work with in a varchar 8000 column. Also, you do have to deal with finding an unused character. Seems like 2 hard to overcome limitations if you are looking for a general purpose method.

I didn't try anything over 8000 characters, because I was working with SQL Server 2000.

It might be interesting to actually test some of these against each other to see if there is a measurable difference in runtime, but it's hard to say which is more efficient unless you hold them to the same standard about what string lengths and types they can work with.

If the string length is much shorter, the solution I posted may require fewer replaces. If the string length is 8000, Corey's method would require at least one additional outer replace to reduce the length to a size it would work with.




CODO ERGO SUM
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 02/16/2006 :  07:07:18  Show Profile  Reply with Quote
"it increases the length of the string in an intermediate step"

I have changed Corey's approach from

Replace CHR with CHR-SPECIAL
to
Replace CHR-CHR with CHR-SPECIAL

This prevents increasing the length, and does away with replaces for single CHR instances, which are redundant anyway.

"finding an unused character"

I am planning to use a Tally table to find a suitable, unused, character, but this obviously adds a bit to the job in hand.

Kristen
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 02/16/2006 :  12:44:32  Show Profile  Reply with Quote
I ran some tests using my method, and Kristin's method with the code below.

I loaded a temp table with 4 different characters with every length from 1 to 8000 characters, for 32,000 rows ordered randomly. I ran the table through both convert methods. Both methods produced correct results.

I repeated this test a number of times, and got very consistant results. Kristin's method seems to run about 8 to 9 times as long.


drop table #temp
go
create table #temp
(
PK	uniqueidentifier	not null
primary key clustered,
CHR	char(1)			not null,
X	varchar(8000)		not null,
)
go
declare @start_time	datetime
declare @end_time	datetime

set @start_time = getdate()

-- Load table in random order
insert into #temp
select	top 100 percent
	PK,
	CHR,
	X
from
(
select
	PK = newid(),
	CHR,
	X =replicate(CHR,number)
from
	dbo.F_TABLE_NUMBER_RANGE(1,8000)
	cross join
	(
	select CHR = 'x'  union all
	select CHR = 'Z'  union all
	select CHR = 'a'  union all
	select CHR = 'b'
	) bb
) a
order by
	PK

set @end_time = getdate()

select
	[Load #temp millisec] =
		datediff(ms,@start_time,@end_time),
	[Start] = convert(varchar(23),@start_time,121),
	[End] = convert(varchar(23),@end_time,121)

go

declare @start_time	datetime
declare @end_time	datetime

set @start_time = getdate()

select
	CHR,
	MIN_LEN_ORIGINAL = min(datalength(X)),
	MAX_LEN_ORIGINAL = max(datalength(X)),
	MIN_LEN_RESULT = min(datalength(Y)),
	MAX_LEN_RESULT = max(datalength(Y)),
	ROW_COUNT = count(*)
from
(
select
	y =
	replace(replace(replace(replace(replace(
	replace(x,replicate(CHR,89),CHR)
	,replicate(CHR,13),CHR)
	,replicate(CHR,5),CHR)
	,replicate(CHR,3),CHR)
	,replicate(CHR,2),CHR)
	,replicate(CHR,2),CHR),
	X,
	CHR
from
	#temp
) a
group by
	CHR
order by
	CHR

set @end_time = getdate()

select 
	[Method 1 millisec] =
		datediff(ms,@start_time,@end_time),
	[Start] = convert(varchar(23),@start_time,121),
	[End] = convert(varchar(23),@end_time,121)

go


declare @start_time	datetime
declare @end_time	datetime

set @start_time = getdate()

select
	CHR,
	MIN_LEN_ORIGINAL = min(datalength(X)),
	MAX_LEN_ORIGINAL = max(datalength(X)),
	MIN_LEN_RESULT = min(datalength(Y)),
	MAX_LEN_RESULT = max(datalength(Y)),
	ROW_COUNT = count(*)
from
(
select
	y =
	replace(replace(replace(
	X,CHR+CHR,CHR+'þ'),'þ'+CHR,''),'þ',''),
	X,
	CHR
from
	#temp
) a
group by
	CHR
order by
	CHR

set @end_time = getdate()

select
	[Method 2 millisec] =
		datediff(ms,@start_time,@end_time),
	[Start] = convert(varchar(23),@start_time,121),
	[End] = convert(varchar(23),@end_time,121)


Results:


(32000 row(s) affected)

Load #temp millisec Start                   End                     
------------------- ----------------------- ----------------------- 
18983               2006-02-16 12:29:52.513 2006-02-16 12:30:11.497

(1 row(s) affected)

CHR  MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT   
---- ---------------- ---------------- -------------- -------------- ----------- 
a    1                8000             1              1              8000
b    1                8000             1              1              8000
x    1                8000             1              1              8000
Z    1                8000             1              1              8000

(4 row(s) affected)

Method 1 millisec Start                   End                     
----------------- ----------------------- ----------------------- 
5496              2006-02-16 12:30:11.560 2006-02-16 12:30:17.057

(1 row(s) affected)

CHR  MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT   
---- ---------------- ---------------- -------------- -------------- ----------- 
a    1                8000             1              1              8000
b    1                8000             1              1              8000
x    1                8000             1              1              8000
Z    1                8000             1              1              8000

(4 row(s) affected)

Method 2 millisec Start                   End                     
----------------- ----------------------- ----------------------- 
47610             2006-02-16 12:30:17.073 2006-02-16 12:31:04.683

(1 row(s) affected)




CODO ERGO SUM
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 02/16/2006 :  13:06:11  Show Profile  Reply with Quote
"Kristin's method seems to run about 8 to 9 times as long"

In that case its Corey's method!!

I'll take a look, looks very interesting, thanks.

Kristen
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 02/16/2006 :  13:57:27  Show Profile  Reply with Quote
That test data does not strike me as representative of most use cases.
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 02/16/2006 :  14:45:00  Show Profile  Reply with Quote
quote:
Originally posted by Arnold Fribble

That test data does not strike me as representative of most use cases.



I agree; I never claimed it was representative of most use cases.

It is, however, a quantifiable, repeatable result with a good amount of data, and it does show that Kristin's method is not "bound to be quicker".

Feel free to post some test results from representative cases.






CODO ERGO SUM
Go to Top of Page

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

Posted - 02/16/2006 :  17:01:55  Show Profile  Visit Seventhnight's Homepage  Reply with Quote
Here is a different repeatable test that shows the opposite:


drop table #temp
go
create table #temp
(
PK	uniqueidentifier	not null primary key clustered,
CHR	char(1)			not null,
X	varchar(8000)		not null,
)
go
declare @start_time	datetime
declare @end_time	datetime

set @start_time = getdate()

-- Load table in random order
insert into #temp
select top 100 percent
	PK,
	CHR,
	X
from
	(
	select
		PK = newid(),
		CHR,
		X=replicate(CHR,number)
	from
		dbo.getSequence(1,500,1)
		cross join
		(
		select CHR = 'x'  union all
		select CHR = 'Z'  union all
		select CHR = 'a'  union all
		select CHR = 'b'
		) bb
	) a
order by
	PK

declare @loop int,
	@pos1 int,
	@pos2 int
Set @loop = 0
While @loop < 10--100
Begin

	Update #temp
	Set
		@pos1 = pos1%len(a.X),
		@pos2 = pos2%len(a.X),
		X = case 
			when @pos1=0 or @pos2=0 or @pos1=@pos2 then
				a.X
			when @pos1 < @pos2 then
				stuff(a.x,@pos1,@pos2-@pos1,replicate(newChar,@pos2-@pos1))
			when @pos2 < @pos1 then
				stuff(a.x,@pos2,@pos1-@pos2,replicate(newChar,@pos1-@pos2))
			else a.X end
	From #temp a
	Cross Join
		(
		Select 
			newChar = dbo.getRandomChar(),
			Pos1 = (dbo.getRandomDigit()*1000 + dbo.getRandomDigit()*100 + dbo.getRandomDigit()*10 + dbo.getRandomDigit()*1),
			Pos2 = (dbo.getRandomDigit()*1000 + dbo.getRandomDigit()*100 + dbo.getRandomDigit()*10 + dbo.getRandomDigit()*1)
		) b

	Set @Loop = @Loop + 1
ENd

Select * From #temp

set @end_time = getdate()

select
	[Load #temp millisec] =
		datediff(ms,@start_time,@end_time),
	[Start] = convert(varchar(23),@start_time,121),
	[End] = convert(varchar(23),@end_time,121)

go

declare @start_time	datetime
declare @end_time	datetime

set @start_time = getdate()

select
	CHR,
	MIN_LEN_ORIGINAL = min(datalength(X)),
	MAX_LEN_ORIGINAL = max(datalength(X)),
	MIN_LEN_RESULT = min(datalength(Y)),
	MAX_LEN_RESULT = max(datalength(Y)),
	ROW_COUNT = count(*)
from
	(
	select
		y =
		replace(replace(replace(replace(replace(
		replace(x,replicate(CHR,89),CHR)
		,replicate(CHR,13),CHR)
		,replicate(CHR,5),CHR)
		,replicate(CHR,3),CHR)
		,replicate(CHR,2),CHR)
		,replicate(CHR,2),CHR),
		X,
		CHR
	from #temp
	) a
group by CHR
order by CHR

set @end_time = getdate()

select 
	[Method 1 millisec] =
		datediff(ms,@start_time,@end_time),
	[Start] = convert(varchar(23),@start_time,121),
	[End] = convert(varchar(23),@end_time,121)

go


declare @start_time	datetime
declare @end_time	datetime
declare @UNIQ		varchar(2)
Set @UNIQ = 'þ'

set @start_time = getdate()

select
	CHR,
	MIN_LEN_ORIGINAL = min(datalength(X)),
	MAX_LEN_ORIGINAL = max(datalength(X)),
	MIN_LEN_RESULT = min(datalength(Y)),
	MAX_LEN_RESULT = max(datalength(Y)),
	ROW_COUNT = count(*)
from
	(
	select
		y =
		replace(replace(replace(
		X,replicate(CHR,2),CHR+@UNIQ),@UNIQ+CHR,''),@UNIQ,''),
		X,
		CHR
	from #temp
	) a
group by CHR
order by CHR

set @end_time = getdate()

select
	[Method 2 millisec] =
		datediff(ms,@start_time,@end_time),
	[Start] = convert(varchar(23),@start_time,121),
	[End] = convert(varchar(23),@end_time,121)


with results:

(2000 row(s) affected)

Load #temp millisec Start                   End                     
------------------- ----------------------- ----------------------- 
40676               2006-02-16 16:59:00.920 2006-02-16 16:59:41.597

(1 row(s) affected)

CHR  MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT   
---- ---------------- ---------------- -------------- -------------- ----------- 
a    1                500              1              495            500
b    1                500              1              497            500
x    1                500              1              496            500
Z    1                500              1              497            500

(4 row(s) affected)

Method 1 millisec Start                   End                     
----------------- ----------------------- ----------------------- 
303               2006-02-16 16:59:41.687 2006-02-16 16:59:41.990

(1 row(s) affected)

CHR  MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT   
---- ---------------- ---------------- -------------- -------------- ----------- 
a    1                500              1              495            500
b    1                500              0              497            500
x    1                500              1              496            500
Z    1                500              1              494            500

(4 row(s) affected)

Method 2 millisec Start                   End                     
----------------- ----------------------- ----------------------- 
150               2006-02-16 16:59:42.000 2006-02-16 16:59:42.150

(1 row(s) affected)


random character and digit functions provided on request.

Corey

Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."

Edited by - Seventhnight on 02/16/2006 17:04:52
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 02/16/2006 :  18:11:13  Show Profile  Reply with Quote
It looks like the test that Corey ran had an error for CHR = b for the second method. I don't think there should be a case where the string would be reduced to zero length.


CHR  MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT   
---- ---------------- ---------------- -------------- -------------- ----------- 
a    1                500              1              495            500
b    1                500              0              497            500
x    1                500              1              496            500
Z    1                500              1              494            500




CODO ERGO SUM
Go to Top of Page

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

Posted - 02/16/2006 :  19:28:52  Show Profile  Visit Seventhnight's Homepage  Reply with Quote
don't know if i can duplicate, but I'll try tomorrow

Corey

Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."
Go to Top of Page

Kristen
Test

United Kingdom
22403 Posts

Posted - 02/17/2006 :  01:38:21  Show Profile  Reply with Quote
In my case I usually have the odd double, triple, and occasionally worse, substring to be reduced. I don't have a massive substring to be reduced, but I'd obviously like my function to handle that should it arise.

I extracted 4,000,000 rows of data which contains '00' and multiple '0...0' into a one column table, and ran some tests on it.

Max length of column = 1000, Average = 94

Count of rows containing:
00 4,321,659
000 1,768,588
0000 10,916
00000 1,499
000000 705

Corey's method 33 seconds
MVJ method 56 seconds

which is about the ratio that Corey's tests gave, so on my "average data" I reckon that is the method I need.

Kristen
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 02/17/2006 :  03:54:46  Show Profile  Reply with Quote
Oh, hark, I hear Annie Lennox singing... :þ

Edited by - Arnold Fribble on 02/17/2006 03:56:41
Go to Top of Page

Seventhnight
Flowing Fount of Yak Knowledge

USA
2878 Posts

Posted - 02/17/2006 :  09:11:26  Show Profile  Visit Seventhnight's Homepage  Reply with Quote
quote:
Originally posted by Michael Valentine Jones

It looks like the test that Corey ran had an error for CHR = b for the second method. I don't think there should be a case where the string would be reduced to zero length.


CHR  MIN_LEN_ORIGINAL MAX_LEN_ORIGINAL MIN_LEN_RESULT MAX_LEN_RESULT ROW_COUNT   
---- ---------------- ---------------- -------------- -------------- ----------- 
a    1                500              1              495            500
b    1                500              0              497            500
x    1                500              1              496            500
Z    1                500              1              494            500




CODO ERGO SUM



By the way... this was becuase I used a 'g' as the @uniq... i changed it to þ and no problems

Corey

Co-worker on children "...when I have children, I'm going to beat them. Not because their bad, but becuase I think it would be fun ..."
Go to Top of Page

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

USA
7020 Posts

Posted - 02/17/2006 :  10:06:41  Show Profile  Reply with Quote
quote:
Originally posted by Arnold Fribble

Oh, hark, I hear Annie Lennox singing... :þ




Annie Lennox??? I don't get the reference


On a different note, it looks like the method that is fastest is dependant on the content of the data and the length of strings to be replaced.

Does anyone have a good solution to the problem of picking an unused token character in Corey's method, or do you just have to take your chances? My guess is that searching the data for an unused character would really slow it down.






CODO ERGO SUM
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 02/17/2006 :  10:41:29  Show Profile  Reply with Quote
quote:
Originally posted by Michael Valentine Jones
Annie Lennox??? I don't get the reference



"Thorn in my side". I'd imagined that the thorn character had gotten its "Revenge" after my earlier warnings.
Go to Top of Page

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 02/17/2006 :  11:21:45  Show Profile  Reply with Quote
quote:

Does anyone have a good solution to the problem of picking an unused token character in Corey's method, or do you just have to take your chances? My guess is that searching the data for an unused character would really slow it down.



I was initially going to suggest to Kristen in that other thread ( http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=61794 ) that CHAR(0) would be a safe choice, but it has certain issues in some collations.

If the column is nvarchar, then any of the 32 codepoints U+FDD0..U+FDEF should be safe. These are defined as noncharacters in Unicode so should never be present in stored text. Though SQL Server doesn't know that.

Edited by - Arnold Fribble on 02/17/2006 11:28:37
Go to Top of Page
Page: of 3 Previous Topic Topic Next Topic  
Previous Page | Next Page
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.2 seconds. Powered By: Snitz Forums 2000