SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 General SQL Server Forums
 Script Library
 Unmangling UTF-8
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

Arnold Fribble
Yak-finder General

United Kingdom
1961 Posts

Posted - 02/27/2006 :  09:15:49  Show Profile  Reply with Quote
You know how it is... you thought the text in the file you bulk inserted was CP-1252. You start using it, and after a while wonder why it seems to have a few bits of garbage in it, and eventually realize that the file you loaded was actually UTF-8.

Here are two functions that might help you unmangle UTF-8 text that's been loaded into a varchar column. The second one has better error checking, but it's probably rather slow if the number of non-ASCII characters in your column is small.

SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO
/* Returns UTF-16 coded nvarchar string from a string of UTF-8 octets.
   Note that while the input is passed in as a varchar, we are actually
   treating each character as if it were a UTF-8 octet. To do this,
   the number returned by ASCII() must be the value of the octet.
   This, along with the PATINDEX() means that this function will
   probably only work if the database collation uses CP1252 for varchar
   and the collation on the value being passed in as @s uses CP1252 too.

   This function has no error checking and does not detect bogus UTF-8
   sequences.
   The function fails if the returned UTF-16 would exceed 4000 code units.
*/
CREATE FUNCTION dbo.utf8_to_utf16 (@s varchar(8000))
RETURNS nvarchar(4000)
BEGIN
  IF @s IS NULL RETURN NULL
  DECLARE @n int, @r nvarchar(4000), @cn int, @octets int, @ch nvarchar(2)
  SET @r = N''
  WHILE 1 = 1 BEGIN
    -- dubious: unexpected octets (0x80-0xBF, 0xF8-0xFF) are treated like 0x00-0x7F
    SET @n = PATINDEX('%[À-÷]%', @s COLLATE Latin1_General_bin)
    IF @n = 0 BEGIN
      SET @r = @r + @s
      BREAK
    END ELSE BEGIN
      SET @r = @r + SUBSTRING(@s, 1, @n-1)
      SET @cn = ASCII(SUBSTRING(@s, @n, 1))
      IF @cn <= 0xDF BEGIN
        SET @octets = 2
        SET @ch = NCHAR((@cn & 0x1F) * 0x40 +
          (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F))
      END ELSE IF @cn <= 0xEF BEGIN
        SET @octets = 3
        SET @ch = NCHAR((@cn & 0x0F) * 0x1000 +
          (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F) * 0x40 +
          (ASCII(SUBSTRING(@s, @n+2, 1)) & 0x3F))
      END ELSE BEGIN
        -- code point in a supplementary plane: output UTF-16 surrogate pair
        SET @octets = 4
        SET @ch = NCHAR((@cn & 0x07) * 0x100 + 
            (ASCII(SUBSTRING(@s, @n+1, 1)) & 0x3F) * 0x04 +
            (ASCII(SUBSTRING(@s, @n+2, 1)) & 0x30) / 0x10 + 0xD7C0) +
          NCHAR((ASCII(SUBSTRING(@s, @n+2, 1)) & 0x0F) * 0x40 +
            (ASCII(SUBSTRING(@s, @n+3, 1)) & 0x3F) + 0xDC00)
      END
      SET @r = @r + @ch
      SET @s = SUBSTRING(@s, @n+@octets, 8000)
    END
  END
  RETURN @r
END
GO
SET QUOTED_IDENTIFIER OFF 
GO
SET ANSI_NULLS ON 
GO



SET QUOTED_IDENTIFIER ON 
GO
SET ANSI_NULLS ON 
GO
/* Returns UTF-16 coded nvarchar string from a string of UTF-8 octets.
   Note that while the input is passed in as a varchar, we are actually
   treating each character as if it were a UTF-8 octet. To do this,
   the number returned by ASCII() must be the value of the octet.
   This, along with the PATINDEX() means that this function will
   probably only work if the database collation uses CP1252 for varchar
   and the collation on the value being passed in as @s uses CP1252 too.

   This function is based on utf8to32b, pp 543-545 Unicode Demystified
   by Richard Gillam (Addison-Wesley Professional, 2002).
   Most erroneous UTF-8 octet sequences convert to U+FFFD, but non-shortest
   sequences are allowed through.
   The function fails if the returned UTF-16 would exceed 4000 code units.
*/
CREATE FUNCTION dbo.utf8_to_utf16a (@s varchar(8000))
RETURNS nvarchar(4000)
BEGIN
  DECLARE @states binary(128)
  SET @states =
    -- 0 1 2 3 4 5 6 7 8 9 A B C D E F101112131415161718191A1B1C1D1E1F
    0x00000000000000000000000000000000FFFFFFFFFFFFFFFF01010101020203FF +
    0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0000000000000000FEFEFEFEFEFEFEFE +
    0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0101010101010101FEFEFEFEFEFEFEFE +
    0xFEFEFEFEFEFEFEFEFEFEFEFEFEFEFEFE0202020202020202FEFEFEFEFEFEFEFE

  DECLARE @size int, @out nvarchar(4000), @outp int, @q int, @state int, @mask int
  SELECT @size = DATALENGTH(@s), @out = 0x, @outp = 0, @q = 1, @state = 0, @mask = 0

  WHILE @q <= @size BEGIN
    DECLARE @c int
    SELECT @c = ASCII(SUBSTRING(@s, @q, 1)), @q = @q + 1,
      @state = CAST(SUBSTRING(@states, 1 + @state*32 + @c/8, 1) AS int)
    IF @state = 0 BEGIN
      SET @outp = @outp + (@c & 0x7F)
      IF @outp <= 0xFFFF
        SET @out = @out + NCHAR(@outp)
      ELSE
        SET @out = @out + NCHAR(@outp / 0x400 + 0xD7C0) + NCHAR((@outp & 0x3FF) + 0xDC00)
      SELECT @outp = 0, @mask = 0
    END ELSE IF @state <= 3 BEGIN
      IF @mask = 0 SET @mask = CAST(SUBSTRING(0x1F0F07, @state, 1) AS int)
      SELECT @outp = (@outp + (@c & @mask)) * 0x40, @mask = 0x3F
    END ELSE BEGIN
      IF @state = 0xFE SET @q = @q - 1
      SELECT @out = @out + NCHAR(0xFFFD), @outp = 0, @state = 0, @mask = 0
    END
  END
  -- end of input is incomplete sequence:
  IF @mask <> 0 SELECT @out = @out + NCHAR(0xFFFD)
  RETURN @out
END
GO
SET QUOTED_IDENTIFIER OFF 
GO
SET ANSI_NULLS ON 
GO

Edited by - Arnold Fribble on 02/27/2006 09:17:32

rockmoose
SQL Natt Alfen

Sweden
3279 Posts

Posted - 03/07/2006 :  15:20:51  Show Profile  Reply with Quote
quote:
You know how it is... you thought the text in the file you bulk inserted was CP-1252. You start using it, and after a while wonder why it seems to have a few bits of garbage in it, and eventually realize that the file you loaded was actually UTF-8.


The forgotten proc "utf8_to_utf16" to the rescue
Amazing stuff! not surprised it came from you Arnold.

____________
Back to work

rockmoose
Go to Top of Page

mateuszh
Starting Member

1 Posts

Posted - 01/29/2013 :  02:47:33  Show Profile  Reply with Quote
hello,

I looking for sql function convert utf16 to utf8 . mybe you have it or algorithm convert? ;)
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.5 seconds. Powered By: Snitz Forums 2000