Unicode String Comparison - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Unicode String Comparison

Author

Topic

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2007-12-12 : 13:30:55

Hello,

The issue I'm dealing with is the default way SQL handles string comparsion. So, I'm getting a lot of strings that are "equal" to SQL but really are different string. The only way I know how to make comparison "correct" is to CAST the string to VARBINARY. So, I was wondering if anyone might have any other ways to compare unicode strings?

Here is a sample of two string that I'm dealing with and some comparisons:

DECLARE @T TABLE (ID INT IDENTITY(1,1), Val NVARCHAR(50))

INSERT @T (Val) VALUES(NCHAR(108) + NCHAR(32) + NCHAR(1282) + NCHAR(782))
INSERT @T (Val) VALUES(NCHAR(108) + NCHAR(32) + NCHAR(1282) + NCHAR(782) + NCHAR(1303))

IF (SELECT Val FROM @T WHERE ID = 1) = (SELECT Val FROM @T WHERE ID = 2)
	PRINT 'EQUAL'
ELSE
	PRINT 'NOT EQUAL'

IF (SELECT Val FROM @T WHERE ID = 1) LIKE (SELECT Val FROM @T WHERE ID = 2)
	PRINT 'LIKE'
ELSE
	PRINT 'NOT LIKE'

IF (SELECT DATALENGTH(Val) FROM @T WHERE ID = 1) = (SELECT DATALENGTH(Val) FROM @T WHERE ID = 2)
	PRINT 'LENGTH EQUAL'
ELSE
	PRINT 'LENGTH NOT EQUAL'

IF (SELECT CAST(Val AS VARBINARY) FROM @T WHERE ID = 1) = (SELECT CAST(Val AS VARBINARY) FROM @T WHERE ID = 2)
	PRINT 'EQUAL'
ELSE
	PRINT 'NOT EQUAL'

EDIT: Forgot to mention that I have tried several different COLLATIONs and that didn't seem to work either.

spirit1
Cybernetic Yak Master

11752 Posts

Posted - 2007-12-12 : 13:49:59

try with Latin1_General_BIN collation

_______________________________________________
Causing trouble since 1980
blog: http://weblogs.sqlteam.com/mladenp
SSMS Add-in that does a few things: www.ssmstoolspack.com <- new version out

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2007-12-12 : 14:06:17

Thanks Spirit, That does work. Unfortunatly, it has the same affect as casting to VARBINARY, cannot use the index.

The table is about 270 million rows. Does anyone have any suggestions for how to get Latin1_General_BIN collation in there? Some possbilities off the top of my head are: add a new column with Latin1_General_BIN collation, add a derived (persisted?) column with Latin1_General_BIN collation or change over the offending column to be Latin1_General_BIN collation.

Cheers!

spirit1
Cybernetic Yak Master

11752 Posts

Posted - 2007-12-12 : 14:21:07

why can't you change the column datatype to different collation?

_______________________________________________
Causing trouble since 1980
blog: http://weblogs.sqlteam.com/mladenp
SSMS Add-in that does a few things: www.ssmstoolspack.com <- new version out

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2007-12-12 : 14:48:53

I might be able to do that. I'll have to some impact analysis.

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2007-12-12 : 18:02:33

That's a pretty bizarre example: codepoint U+0517 (= 1303 decimal) isn't even assigned to a character until Unicode 5.1.0, and that's still in beta.

0517;CYRILLIC SMALL LETTER RHA;Ll;0;L;;;;;N;;;0516;;0516

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2007-12-12 : 18:40:43

Unfortunately, I'm not incredibly versed on Unicode. We seem to have a lot of high numbered characters that come in from other languages (Japanese and Chinese being the most prevalent). In order to parse out the individual characters I looped through the string and cast the character to UNICODE: UNICODE(SUBSTRING(@String, @Pos, 1))). That is where I got the 1303 from. There are plenty of others that are 12500+ codes.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources