Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 General SQL Server Forums
 Script Library
 Unmangling UTF-8

Author  Topic 

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2006-02-27 : 09:15:49
You know how it is... you thought the text in the file you bulk inserted was CP-1252. You start using it, and after a while wonder why it seems to have a few bits of garbage in it, and eventually realize that the file you loaded was actually UTF-8.

Here are two functions that might help you unmangle UTF-8 text that's been loaded into a varchar column. The second one has better error checking, but it's probably rather slow if the number of non-ASCII characters in your column is small.

SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
/* Returns UTF-16 coded nvarchar string from a string of UTF-8 octets.
Note that while the input is passed in as a varchar, we are actually
treating each character as if it were a UTF-8 octet. To do this,
the number returned by ASCII() must be the value of the octet.
This, along with the PATINDEX() means that this function will
probably only work if the database collation uses CP1252 for varchar
and the collation on the value being passed in as @s uses CP1252 too.

This function has no error checking and does not detect bogus UTF-8
sequences.
The function fails if the returned UTF-16 would exceed 4000 code units.
*/
CREATE FUNCTION dbo.utf8_to_utf16 (@s varchar(8000))
RETURNS nvarchar(4000)
BEGIN
IF @s IS NULL RETURN NULL
DECLARE @n int, @r nvarchar(4000), @cn int, @octets int, @ch nvarchar(2)
SET @r = N''
WHILE 1 = 1 BEGIN
-- dubious: unexpected octets (0x80-0xBF, 0xF8-0xFF) are treated like 0x00-0x7F
SET @n = PATINDEX('%[À-÷]%', @s COLLATE Latin1_General_bin)
IF @n = 0 BEGIN
SET @r = @r + @s
BREAK
END ELSE BEGIN
SET @r = @r + SUBSTRING(@s, 1, @n-1)
SET @cn = ASCII(SUBSTRING(@s, @n, 1))
IF @cn <= 0xDF BEGIN
SET @octets = 2
SET @ch = NCHAR((@cn & 0x1F) * 0x40 +
(ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F))
END ELSE IF @cn <= 0xEF BEGIN
SET @octets = 3
SET @ch = NCHAR((@cn & 0x0F) * 0x1000 +
(ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F) * 0x40 +
(ASCII(SUBSTRING(@s, @n+2, 1)) & 0x3F))
END ELSE BEGIN
-- code point in a supplementary plane: output UTF-16 surrogate pair
SET @octets = 4
SET @ch = NCHAR((@cn & 0x07) * 0x100 +
(ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F) * 0x04 +
(ASCII(SUBSTRING(@s, @n+2, 1)) & 0x30) / 0x10 + 0xD7C0) +
NCHAR((ASCII(SUBSTRING(@s, @n+2, 1)) & 0x0F) * 0x40 +
(ASCII(SUBSTRING(@s, @n+3, 1)) & 0x3F) + 0xDC00)
END
SET @r = @r + @ch
SET @s = SUBSTRING(@s, @n+@octets, 8000)
END
END
RETURN @r
END
GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO



SET QUOTED_IDENTIFIER ON
GO
SET ANSI_NULLS ON
GO
/* Returns UTF-16 coded nvarchar string from a string of UTF-8 octets.
Note that while the input is passed in as a varchar, we are actually
treating each character as if it were a UTF-8 octet. To do this,
the number returned by ASCII() must be the value of the octet.
This, along with the PATINDEX() means that this function will
probably only work if the database collation uses CP1252 for varchar
and the collation on the value being passed in as @s uses CP1252 too.

This function is based on utf8to32b, pp 543-545 Unicode Demystified
by Richard Gillam (Addison-Wesley Professional, 2002).
Most erroneous UTF-8 octet sequences convert to U+FFFD, but non-shortest
sequences are allowed through.
The function fails if the returned UTF-16 would exceed 4000 code units.
*/
CREATE FUNCTION dbo.utf8_to_utf16a (@s varchar(8000))
RETURNS nvarchar(4000)
BEGIN
DECLARE @states binary(128)
SET @states =
-- 0 1 2 3 4 5 6 7 8 9 A B C D E F101112131415161718191A1B1C1D1E1F
0x00000000000000000000000000000000FFFFFFFFFFFFFFFF01010101020203FF +
0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0000000000000000FEFEFEFEFEFEFEFE +
0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0101010101010101FEFEFEFEFEFEFEFE +
0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0202020202020202FEFEFEFEFEFEFEFE

DECLARE @size int, @out nvarchar(4000), @outp int, @q int, @state int, @mask int
SELECT @size = DATALENGTH(@s), @out = 0x, @outp = 0, @q = 1, @state = 0, @mask = 0

WHILE @q <= @size BEGIN
DECLARE @c int
SELECT @c = ASCII(SUBSTRING(@s, @q, 1)), @q = @q + 1,
@state = CAST(SUBSTRING(@states, 1 + @state*32 + @c/8, 1) AS int)
IF @state = 0 BEGIN
SET @outp = @outp + (@c & 0x7F)
IF @outp <= 0xFFFF
SET @out = @out + NCHAR(@outp)
ELSE
SET @out = @out + NCHAR(@outp / 0x400 + 0xD7C0) + NCHAR((@outp & 0x3FF) + 0xDC00)
SELECT @outp = 0, @mask = 0
END ELSE IF @state <= 3 BEGIN
IF @mask = 0 SET @mask = CAST(SUBSTRING(0x1F0F07, @state, 1) AS int)
SELECT @outp = (@outp + (@c & @mask)) * 0x40, @mask = 0x3F
END ELSE BEGIN
IF @state = 0xFE SET @q = @q - 1
SELECT @out = @out + NCHAR(0xFFFD), @outp = 0, @state = 0, @mask = 0
END
END
-- end of input is incomplete sequence:
IF @mask <> 0 SELECT @out = @out + NCHAR(0xFFFD)
RETURN @out
END
GO
SET QUOTED_IDENTIFIER OFF
GO
SET ANSI_NULLS ON
GO

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2006-03-07 : 15:20:51
quote:
You know how it is... you thought the text in the file you bulk inserted was CP-1252. You start using it, and after a while wonder why it seems to have a few bits of garbage in it, and eventually realize that the file you loaded was actually UTF-8.


The forgotten proc "utf8_to_utf16" to the rescue
Amazing stuff! not surprised it came from you Arnold.

____________
Back to work

rockmoose
Go to Top of Page

mateuszh
Starting Member

1 Post

Posted - 2013-01-29 : 02:47:33
hello,

I looking for sql function convert utf16 to utf8 . mybe you have it or algorithm convert? ;)
Go to Top of Page
   

- Advertisement -