Help! Joining Very Large against Small Table

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Transact-SQL (2008)

Help! Joining Very Large against Small Table

Author

Topic

muratos
Starting Member

22 Posts

Posted - 2011-07-06 : 09:12:09

Hi,

We try to join two tables and in no way make it very fast. table_a has around 35 million rows. table_b has just 25. The basic structure as follows and the query is very simple.

select ID,hotel from table_a inner join table_b on table_a.accID=table_b.accID where
price<1000 and price>100
flightdate<'20110930' and flightdate>'20110901'
order by price.

table_a
ID int
hotel int
accID int
price decimal(9,2)
flightdate date

table_b
accID int

table_a has composite clustered index on price, accID fields.
table_b has on accID.

table_a has nonclustered index on price, flightdate and other fields are in INCLUDE.

PROBLEM is:

It is rather slow with order by price on accID join.
If we change the clustered index to accID, price ; this time the join with other table becomes faster but order by price becomes real slow.

The execution plans are always done by Index Scans or Seeks according to index force. But, still very slow.

Is there something we can change or better make?

p.s : It is stated on forums that we should always include order by in the queries and not trust clustered indexes eventhough it is indexed according to this field. If not applied, the engine can bring different rows each time according to execution plan choice.

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 09:19:44

How selective is accID in table_a
How big is the resultset?
You need an index on table_a.accID
If it is selective then it can just be that column as it will use the clustered index to get the other columns which should be quick anyway.

I would consider clustering on table_a.ID for this table as it will make other indexes more efficient.

AccID include hotel price flightdate

It also depends on how many rows are returned from the filter on table_a
==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 09:37:59

table_a - it gives the table structures.
I'm guessing that table_b is just to give the accID to return.

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

muratos
Starting Member

22 Posts

Posted - 2011-07-06 : 09:40:24

Thanks for the help :)

accID has distinct 1400 values. I think it is not selective when we consider 35m rows, right?

We need only 100 rows but we have to order all according to price to get the cheapest ones at first. (actually, we will use row_number over(order by price) as well for web sorting, but didn't come to that level yet

)

price and flightdate field filters can even cover all table rows according to parameter search usage. So, no input range estimates here.

as far as I understand, your offer is to use
table_a.ID as clustered index and define table_a.accID as nonclustered one with INCLUDE hotel price flightdate.

Will it not make order by price troublesome?

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 09:50:31

The order by should not be a problem as you only have to sort your 100 row resultset.

select ID,hotel
from table_a
join table_b
on table_a.accID=table_b.accID
where price<1000
and price>100
and flightdate<'20110930'
and flightdate>'20110901'

ignore the order by for the moment
what do these return
select count(*)
from table_a
join table_b
on table_a.accID=table_b.accID

select count(*)
from table_a
where price<1000
and price>100
and flightdate<'20110930'
and flightdate>'20110901'

select count(*)
from table_a
where price<1000
and price>100

select count(*)
from table_a
where flightdate<'20110930'
and flightdate>'20110901'

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

Sachin.Nand

2937 Posts

Posted - 2011-07-06 : 09:50:52

What I would suggest is

1) Create a Clustered Index on accID for table_a

2)Have non clustered index on price and flightdate which covers ID,hotel

3)Create a non clustered Index on accID for table_b

Looks like the order by for the column "price" might not be getting enough memory and spilling over to tempdb.Use the profiler and just select the Warning options so that it only shows the sort spills onto the tempdb while the query is running.

Also can you post the query plan ?

PBUH

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 10:00:01

>> 1) Create a Clustered Index on accID for table_a
????

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

Sachin.Nand

2937 Posts

Posted - 2011-07-06 : 10:08:03

Because it will force the optimiser to go for a merge join on the inner table which will sort the data for the inner table.
Does not make sense for keeping ID as a clustered index as it is no where being referenced in the above query.

PBUH

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 10:12:57

accID only has 1400 unique values so it's not clear whether the join to table_a is very selective or useful.
Clustering on this would mean that a rowid needs to be includied in all other indexes making thyem less efficient and wasting space.

Could be that the filters on table_a reduce the resultset and this is what needs to be optimised - in any case Iit's not cloear whether the join is actually an issue.

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

muratos
Starting Member

22 Posts

Posted - 2011-07-06 : 10:14:23

@nigelrivett

Actually, I should sort first and get the cheapest 100 rows from 35m rows. Getting any 100 rows and sorting them is not what I want.

I will send the counts now.

@Sachin.Nand

I have simplified the tables here. But, I will get query plan from the real tables and send anyway.

@nigelrivett and @Sachin.Nand

I will apply your index suggestions today and send the results.

p.s: Index changes take too much time

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 10:20:28

Putting the clustered index on ID will make it faster to create non-clustered indexes as well as making them more efficient.

>> Actually, I should sort first and get the cheapest 100 rows from 35m rows. Getting any 100 rows and sorting them is not what I want.
That's not what you posted
Do you really have

select top 100 ID,hotel
from table_a
join table_b
on table_a.accID=table_b.accID
where price<1000
and price>100
and flightdate<'20110930'
and flightdate>'20110901'
order by price

If so then how many rows are you sorting to get the result - i.e. how many rows without the top 100?
Might be better to think of it (and maybe code it) as

select top 100 ID,hotel
from table_a
where price<1000 and price>100
and flightdate<'20110930' and flightdate>'20110901'
and accID in (select accID from table_b)
order by price

Note your query will need to get ID, price order them get the top 100 then get the data for that 100 ID's (that's if you cluster on ID).
You can then build non-clustered indexes to suit the filter and include price.
To design the non-clustered indexes we need to know the number of rows returned by the filters.

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

muratos
Starting Member

22 Posts

Posted - 2011-07-06 : 10:34:23

The inner join returned 800K rows count. So, I should order them by price and get the top 100.

Unfortunately, we can not guess the input for price and flightdate.
As I stated, a user can select a month or a day for flightdate range or any price. p.s : This is a monthly table.

So, this query would return all rows from the table

select count(*)
from table_a
where flightdate<'20110930'
and flightdate>'20110901'

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 10:52:42

In that case I would consider clustered index on ID, non_clustered on ACCID including flightdate, price

then try
select top 100 id, price
from table_a
where accID in (select accID from table_b)
order by price

Check the query plan to make sure it is using the index and doing a lookup on ACCID followed by a filter
and see how long that takes - that should be close to the maximum time taken as the filters will be on that resultset using data included in the index.
If that's ok it's just a matter of coding so that the index is used.

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

muratos
Starting Member

22 Posts

Posted - 2011-07-06 : 11:32:33

@nigelrivett

You are right. Now it is much faster under 1 sec, with order by price. (without price, filters yet)

table_a (35m rows) uses nonclustered index seek. 8%
table_b (25 rows) uses clustered index scan. 0%
then
nested loops (inner join) 4%
sort (top n sort) 88%

I have created ID as clustered and accID as nonclustered with all other columns in INCLUDE.

I will check how where filters affect the results

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 11:37:30

If they cause an issue you can separate the query into parts force the plan as you woulk like.
Is this part of a stored procedure? It looks like something that feeds a report - an SP makes things a lot easier to handle.

The saving is probably because it was previously trying to sort a wide resultset - now it just has price and id.

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

muratos
Starting Member

22 Posts

Posted - 2011-07-06 : 11:56:44

Yes, it is in an sp. I think we will dynamically force plans according to the filter input range. It looks like it makes changes in execution times as well.

nigelrivett
Master Smack Fu Yak Hacker

3385 Posts

Posted - 2011-07-06 : 12:41:48

I would log the start and end of the sp and parameters so you can see if there are any problems and if they correspond to different parameters or other things going on at the time.

==========================================
Cursors are useful if you don't know sql.
SSIS can be used in a similar way.
Beer is not cold and it isn't fizzy.

ankie
Starting Member

2 Posts

Posted - 2011-07-06 : 22:58:07

Hi,

I suspect the cause of the slowness is due to the order of your sql criteria. try this instead.

select ID,hotel from table_a inner join table_b on table_a.accID=table_b.accID
where flightdate>'20110901' and flightdate<'20110930'
and price>100 and price<1000
order by price

or this:
select ID,hotel from table_a inner join table_b on table_a.accID=table_b.accID
where price>100 and price<1000
and flightdate>'20110901' and flightdate<'20110930'
order by price

muratos
Starting Member

22 Posts

Posted - 2011-07-07 : 02:51:37

@ankie

Sql engine decides which order to follow according to statistics no matter how we write them. Although it is possible to force order by a hint, not recommended.

OPTION (FORCE ORDER)

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources