Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 Transact-SQL (2000)
 Eliminating Control Characters

Author  Topic 

per
Starting Member

1 Post

Posted - 2002-04-23 : 12:21:31
I'm using SQL Server 7.0 with SP2 and the XML plug in.

I'm using a ntext column to store some XHTML data and am running into the following.

The following are invalid unicode XML characters:

- ASCII range 0x00-0x19 except 9, 10, and 13
- Unicode 0xFFFD-0xFFFF

Needless to say, when I first inserted text into my table (in a ntext column), I didn't check it for invalid characters. Sigh. Now they're mucking up things when using the XML add on because they won't parse in XML consumers.

So, I'm attempting to fix the data in the DB. The first task is to match the invalid characters. With exceptions, PATINDEX and LIKE will both match control characters if the expression is correct.

To create the proper string:

declare @pos int, @expr nvarchar(100)
set @expr = N''
set @pos = 0
while @pos < 0x20
begin
if (@pos <> 9 and @pos <> 13 and @pos <> 10)
set @expr = @expr + nchar(@pos)
set @pos = @pos + 1
end

set @expr = @expr + nchar(0xFFFD)
set @expr = @expr + nchar(0xFFFE)
-- DOES NOT WORK: set @expr = @expr + nchar(0xFFFF)
set @expr = N'%[' + @expr + N']%'



Then with the given expression, select matches from the column 'body'. Both of the following selects work:


-- Return column id, the index of first control char, the char itself and entire string
select [id], patindex(@expr, [body]) as [index], unicode(substring([body], patindex(@expr, [body]), 1)) as [uni], [body]
from [posttbl]
where 0 < patindex(@expr, [body])

select [id], patindex(@expr, [body]) as [index], unicode(substring([body], patindex(@expr, [body]), 1)) as [uni], [body]
from [posttbl]
where body like @expr



Well, I said they work but there are the following limitations:

- if @expr includes 0xFFFF (as it does not above), then nothing is ever matched. At all.

- on the other hand, any rows that contain only 0x00 are never returned, even though 0x00 is in @expr. In fact, if 'body' ever contains 0x00, then datalength(body) and displaying its value always returns the truncated length.


How can I overcome this problem? I would really love to be able to correct the thousands of records I already have in the database.

Thanks in advance,
Per
   

- Advertisement -