Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2008 Forums
 Transact-SQL (2008)
 Performance on string query

Author  Topic 

iulianionescu
Starting Member

14 Posts

Posted - 2010-01-24 : 14:25:30
Hi, I have a question regarding a query on strings. I have 2 tables, one with states and one with cities. They are linked via a third table called Zips. I need to retrieve a list of all the cities in a state that start with a certain letter. I was wondering if there is any best practice in regards to the query. Initially this was the query:

SELECT DISTINCT Zip_Cities.CityID, Zip_Cities.CityName, Zip_Counties.CountyID, Zip_Counties.CountyName
FROM Zip_Zips INNER JOIN Zip_Cities ON Zip_Zips.CityID = Zip_Cities.CityID INNER JOIN Zip_Counties ON Zip_Zips.CountyID = Zip_Counties.CountyID
WHERE (Zip_Zips.StateID = @StateID) AND (SUBSTRING(Zip_Cities.CityName, 1, 1) = @Letter)
ORDER BY Zip_Cities.CityName

The part that I am interested in is the bold. I had the query like this as well:

[...]
(Zip_Cities.CityName LIKE @Letter + '%')
[...]

I have a non-clustered index on the CityName field. Is there a better approach, like using CHARINDEX, maybe? But mostly, between the two alternatives which one is the best? Using SQL Profiler I noticed that the duration seems to be slightly longer on the second one, but the CPU seems to be slightly higher on the first one...

Any hints would be appreciated,

Iulian




Regards,

Iulian

Kristen
Test

22859 Posts

Posted - 2010-01-24 : 15:33:59
I don't think

(SUBSTRING(Zip_Cities.CityName, 1, 1) = @Letter)

will use an index in CityName (you'd have to check the Query plan to be sure, but in general if you wrap a column in a Function then indexes are not used)

Whereas

(Zip_Cities.CityName LIKE @Letter + '%')

should use the index - provided that the RegEx pattern does NOT start with a wildcard.

There is a slight risk that parameter sniffing means that the actual value of @Letter cannot be determined when the Query Plan is made, and thus the index is not used. Again, you'd need to check the query plan to be certain.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-01-24 : 15:35:51
Comment in the SHOWPLAN to see the Query Plan, OR the STATISTICS to see the logical stats

-- SET STATISTICS PROFILE ON; SET SHOWPLAN_TEXT ON -- If Temp Tables involved
-- SET SHOWPLAN_TEXT ON
GO
-- SET STATISTICS IO ON; SET STATISTICS TIME ON

... put your query here ...

SET STATISTICS IO OFF; SET STATISTICS TIME OFF
GO
SET SHOWPLAN_TEXT OFF
GO

The Logical Stats will probably tell you whether an Index is being used, or not.
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-01-24 : 15:46:54
P.S. Just looked at it again:

WHERE (Zip_Zips.StateID = @StateID) AND (SUBSTRING(Zip_Cities.CityName, 1, 1) = @Letter)

can benefit from Index on StateID and CityName. If you have two indexes, one for each, that's probably fine.

Use of an index will depend on the selectivity of the values. If there are very few distinct values for StateID, and lots of entries for each value, then it makes a poor index choice.

Consider an index on a Yes/No column. If 99% of the rows are "Yes" then the index if you query is trying to find the "No"s. If the Yes/No ration is between 75/25 and 50/50 the index won't be used at all. Perhaps it won't even be used at 80/20 - I don't know the exact way it works.

You can also make an index "cover" the query; a single index on StateID, CityName can be used to "cover" the WHERE clause. Then you have to choose which order you put the columns (in the index). StateID, CityName will also be used on queries just for StateID; CityName, StateID will also be used for queries just on CityName. So if you have lots of either of those that may make the decision for you. Also, if one of those two columns is more "selective" than the other, then putting that first helps encourage SQL to choose that index .

You can stick additional columns in the index that have lousy selectivty, but cover the query. For example, we have lots of tables with an "IsActive" column. Its just a Yes/No BIT. Most of our queries have "WHERE xxx = yyy AND IsActive = 1" - so sticking IsActive as the key in the index covers the query

You can also put columns in the index to cover the SELECT as well. But obviously putting ALL the columns in the table IN the index is counter productive.

But if you have a very frequent query, particularly on a large table, it can help:

SELECT CityID
FROM ...
WHERE StateID = @StateID AND CityName = @CityName

if this was frequent then an index with keys StateID, CityName, CityID (CityID needs to be LAST!) would mean that the Index had "covered" the whole query
Go to Top of Page

iulianionescu
Starting Member

14 Posts

Posted - 2010-01-25 : 10:55:48
Thanks for all the help!!

I set the showplan on and this is what I got:

FOR: (SUBSTRING(Zip_Cities.CityName, 1, 1) = @Letter)

|--Sort(DISTINCT ORDER BY:([RentalsDB].[dbo].[Zip_Cities].[CityName] ASC, [RentalsDB].[dbo].[Zip_Cities].[CityID] ASC, [RentalsDB].[dbo].[Zip_Counties].[CountyID] ASC))
|--Hash Match(Inner Join, HASH:([RentalsDB].[dbo].[Zip_Zips].[CountyID])=([RentalsDB].[dbo].[Zip_Counties].[CountyID]))
|--Hash Match(Inner Join, HASH:([RentalsDB].[dbo].[Zip_Cities].[CityID])=([RentalsDB].[dbo].[Zip_Zips].[CityID]))
| |--Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Cities].[IX_Zip_Cities]), WHERE:(substring([RentalsDB].[dbo].[Zip_Cities].[CityName],(1),(1))=[@Letter]))
| |--Clustered Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Zips].[PK_Zip_Zips]), WHERE:([RentalsDB].[dbo].[Zip_Zips].[StateID]=[@StateID]))
|--Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Counties].[IX_Zip_Counties]))

FOR: (Zip_Cities.CityName LIKE @Letter + '%')

|--Sort(DISTINCT ORDER BY:([RentalsDB].[dbo].[Zip_Cities].[CityName] ASC, [RentalsDB].[dbo].[Zip_Cities].[CityID] ASC, [RentalsDB].[dbo].[Zip_Counties].[CountyID] ASC))
|--Hash Match(Inner Join, HASH:([RentalsDB].[dbo].[Zip_Zips].[CountyID])=([RentalsDB].[dbo].[Zip_Counties].[CountyID]))
|--Hash Match(Inner Join, HASH:([RentalsDB].[dbo].[Zip_Cities].[CityID])=([RentalsDB].[dbo].[Zip_Zips].[CityID]))
| |--Nested Loops(Inner Join, OUTER REFERENCES:([Expr1013], [Expr1014], [Expr1015]))
| | |--Compute Scalar(DEFINE:([Expr1013]=LikeRangeStart([@Letter]+'%'), [Expr1014]=LikeRangeEnd([@Letter]+'%'), [Expr1015]=LikeRangeInfo([@Letter]+'%')))
| | | |--Constant Scan
| | |--Index Seek(OBJECT:([RentalsDB].[dbo].[Zip_Cities].[IX_Zip_Cities]), SEEK:([RentalsDB].[dbo].[Zip_Cities].[CityName] > [Expr1013] AND [RentalsDB].[dbo].[Zip_Cities].[CityName] < [Expr1014]), WHERE:([RentalsDB].[dbo].[Zip_Cities].[CityName] like [@Letter]+'%') ORDERED FORWARD)
| |--Clustered Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Zips].[PK_Zip_Zips]), WHERE:([RentalsDB].[dbo].[Zip_Zips].[StateID]=[@StateID]))
|--Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Counties].[IX_Zip_Counties]))

AND FOR: (Zip_Cities.CityName BETWEEN @Letter AND @Letter + 'ZZZZZ')

|--Sort(DISTINCT ORDER BY:([RentalsDB].[dbo].[Zip_Cities].[CityName] ASC, [RentalsDB].[dbo].[Zip_Cities].[CityID] ASC, [RentalsDB].[dbo].[Zip_Counties].[CountyID] ASC))
|--Hash Match(Inner Join, HASH:([RentalsDB].[dbo].[Zip_Zips].[CountyID])=([RentalsDB].[dbo].[Zip_Counties].[CountyID]))
|--Hash Match(Inner Join, HASH:([RentalsDB].[dbo].[Zip_Cities].[CityID])=([RentalsDB].[dbo].[Zip_Zips].[CityID]))
| |--Index Seek(OBJECT:([RentalsDB].[dbo].[Zip_Cities].[IX_Zip_Cities]), SEEK:([RentalsDB].[dbo].[Zip_Cities].[CityName] >= [@Letter] AND [RentalsDB].[dbo].[Zip_Cities].[CityName] <= [@Letter]+'ZZZZZ') ORDERED FORWARD)
| |--Clustered Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Zips].[PK_Zip_Zips]), WHERE:([RentalsDB].[dbo].[Zip_Zips].[StateID]=[@StateID]))
|--Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Counties].[IX_Zip_Counties]))


Based on what I see the LIKE @Letter + '%' does exactly what I did in the last one by creating some expressions and then returning the city name that fits. So, I am thinking that the last might be the best. It looks like the seek is made on the index. I am not exactly sure about the difference between INDEX SEEK and INDEX SCAN though...

All 4 tables (cities, states, counties and zips) are quite static, very rarely will I see something changed (unless we conquer a country or something :-))

Iulian


Regards,

Iulian
Go to Top of Page

Transact Charlie
Master Smack Fu Yak Hacker

3451 Posts

Posted - 2010-01-25 : 11:16:06
index seek is good and index scan is bad (generally)

Consider an index seek as looking up a dictionary knowing the word you were thinking of starts with an S, then a T and then an I. straight away you've got to the right page (probably). Seeking!

An index scan is like knowing the word has an S and and T and an I in it. looks like you are going to have to check all the pages of the dictionary to find all the definitions that it could be. Scanning!

In a dictionary the words in bold are like the index and the definitions are like data.


Charlie
===============================================================
Msg 3903, Level 16, State 1, Line 1736
The ROLLBACK TRANSACTION request has no corresponding BEGIN TRANSACTION
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-01-25 : 14:25:10
As T.C. says:

Table Scan and Index Scan "thumb through all the pages" looking for the entry.

Index Seek goes straight to the entry(ies) - and thus is better.

FOR: (SUBSTRING(Zip_Cities.CityName, 1, 1) = @Letter)

| |--Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Cities].[IX_Zip_Cities]), WHERE:(substring([RentalsDB].[dbo].[Zip_Cities].[CityName],(1),(1))=[@Letter]))

Scan the index IX_Zip_Cities using the WHERE clause expression. Because the function "substring()" is involved this is a Scan [all entries in the index are read, and tested] rather than a Seek.

Then merge that with the results of:

| |--Clustered Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Zips].[PK_Zip_Zips]), WHERE:([RentalsDB].[dbo].[Zip_Zips].[StateID]=[@StateID]))

This Scans the index (the clustered index) PK_Zip_Zips to find WHERE [StateID]=[@StateID]. Presumably PK_Zip_Zips index has nothing to do with StateID, and is being used to traverse the whole table (i.e. because it is a clustered index / PK) in order to find matching StateID. So it looks like a specific index on StateID would help - and SQL should use it instead of the PK Scan (I presume you have not got a StateID index currently ...)

FOR: (Zip_Cities.CityName LIKE @Letter + '%')

| | |--Index Seek(OBJECT:([RentalsDB].[dbo].[Zip_Cities].[IX_Zip_Cities]), SEEK:([RentalsDB].[dbo].[Zip_Cities].[CityName] > [Expr1013] AND [RentalsDB].[dbo].[Zip_Cities].[CityName] < [Expr1014]), WHERE:([RentalsDB].[dbo].[Zip_Cities].[CityName] like [@Letter]+'%') ORDERED FORWARD)

Using the index IX_Zip_Cities find entries WHERE CityName > [Expr1013]=LikeRangeStart([@Letter]+'%') - LikeRangeStart() sorts out the lowest possible value for [@Letter]+'%' - AND CityName [< Expr1014]=LikeRangeEnd([@Letter]+'%') - ditto LikeRangeEnd() evaluates the highest possible value for [@Letter]+'%'

A SEEK is then done BETWEEN these limits WHERE [CityName] like [@Letter]+'%'

Maximum efficiency

| |--Clustered Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Zips].[PK_Zip_Zips]), WHERE:([RentalsDB].[dbo].[Zip_Zips].[StateID]=[@StateID]))

Still doing a Scan of the whole table using PK_Zip_Zips to try to find WHERE StateID = @StateID

AND FOR: (Zip_Cities.CityName BETWEEN @Letter AND @Letter + 'ZZZZZ')

Pretty much exactly the same thing:

| |--Index Seek(OBJECT:([RentalsDB].[dbo].[Zip_Cities].[IX_Zip_Cities]), SEEK:([RentalsDB].[dbo].[Zip_Cities].[CityName] >= [@Letter] AND [RentalsDB].[dbo].[Zip_Cities].[CityName] <= [@Letter]+'ZZZZZ') ORDERED FORWARD)

Index Seek on IX_Zip_Cities WHERE CityName BETWEEN @Letter AND @Letter+'ZZZZZ'
Go to Top of Page

iulianionescu
Starting Member

14 Posts

Posted - 2010-01-25 : 15:14:17
Looks like I am almost at the bottom of it, thanks a lot! I added a few indexes that made it better. Just one more curiousity, the way the tables are designed is:

zips: zip, stateID, countyID, cityID
states: stateID, stateName
counties: countyID, countyName
cities: cityID, cityName

on Zip I have the proper relationship to each table plus these indexes:
(zip, stateID)
(zip, countyID)
(zip, cityID)
(zip, stateID, countyID, cityID)

However, in the execution plan I notice that the joining with the counties table is done with the Scan while the others is done with a Seek:

|--Sort(DISTINCT ORDER BY:([RentalsDB].[dbo].[Zip_Cities].[CityName] ASC, [RentalsDB].[dbo].[Zip_Cities].[CityID] ASC, [RentalsDB].[dbo].[Zip_Counties].[CountyID] ASC))
|--Hash Match(Inner Join, HASH:([RentalsDB].[dbo].[Zip_Zips].[CountyID])=([RentalsDB].[dbo].[Zip_Counties].[CountyID]))
|--Hash Match(Inner Join, HASH:([RentalsDB].[dbo].[Zip_Zips].[CityID])=([RentalsDB].[dbo].[Zip_Cities].[CityID]))
| |--Index Seek(OBJECT:([RentalsDB].[dbo].[Zip_Zips].[IX_Zips_ZipStateCountyCity]), SEEK:([RentalsDB].[dbo].[Zip_Zips].[StateID]=[@StateID]) ORDERED FORWARD)
| |--Index Seek(OBJECT:([RentalsDB].[dbo].[Zip_Cities].[IX_Zip_Cities]), SEEK:([RentalsDB].[dbo].[Zip_Cities].[CityName] >= [@Letter] AND [RentalsDB].[dbo].[Zip_Cities].[CityName] <= [@Letter]+'ZZZZZ') ORDERED FORWARD)
|--Index Scan(OBJECT:([RentalsDB].[dbo].[Zip_Counties].[IX_Zip_Counties]))

It has a 7% cost. In any case, it works much better now :-)

Regards,

Iulian


Regards,

Iulian
Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2010-01-26 : 02:47:25
Looks like SQL has choosen IX_Zips_ZipStateCountyCity for the query (dunno why, if ZIP is the really first key field??) to get the StateID=@StateID - that would only make sense, to me, if StateID was the first key field. If that was the case choosing that index would have the additional benefit that it "covers" the cityID and countyID for the JOINs to Zip_Cities and Zip_Counties.

If ZIP is the Clustered PK on [zips] table I don't think there is any benefit in putting that in the secondary indexes, because SQL will add it anyway (its got to find the ZIP record once its looked up the CityID (or whatever the secondary index is for), so it added the PK columns on the end (maybe SQL is smart enough to not do that if the secondary index already contains that column).

Might have something to do with the ORDER BY too. Does your query have a DISTINCT in it? Maybe SQL is short-circuiting deletion of duplicates by using that index.
Go to Top of Page
   

- Advertisement -