Help Speeding up a count(*) query

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2008 Forums

Transact-SQL (2008)

Help Speeding up a count(*) query

Author

Topic

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-13 : 20:45:36

Hey guys, i'm trying to scale my DB server outward a little,
however i'm still not getting the performance increase i was
hopeing for.

basically i have one giant table, (approx 200 columns wide, about
220 million rows, and the database is only updated quarterly) that
i have split into 10 smaller tables and linked off a UID column.

i'm basically writing a web-interface where users can go to a site,
and punch in criteria, the server then generates a dynamic query,
and submits to the SQL DB for results. problems is, on large
universes, the server is taking to long, for example, this is a
query that will result in about 67 million rows. server takes 6
minutes to finish, i need it in 30sec or less... anything i can do
to get counting to run faster?


SELECT state, count(*) from (SELECT m.state, row_number() over
(partition by phone order by m.uid desc) as recid
FROM TEST.DBO.MAIN m 
	Join TEST.dbo.CONTACT c on m.uid = c.uid
 Where 1=1
		and c.Name_Type = 'P'
	) as f
where recid = 1
group by state 
order by state

also worth mentioning, our database has more then 1 record per
household, hence the partitoning by phones (thanks to PESO for that)

Thanks in advance for all your help, you guys always seem to help me out of a bind!

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-09-13 : 23:25:37

What indexes do you have on the tables?

Are you saying it outputs 67 million rows?

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-13 : 23:48:51

i have this thing indexed every-which-way. it's on a 3TB array, so i'm not really worried about wasting space indexing, and because there are so many different combos our reps can query the database, i basically capture everything that runs though the site and run it though DTA on a weekly basis... (honestly, i'm not quite that good at writing my own indexes yet, i dont quite understand all the fine-tuning ins-and-outs of setting up indexing)

an no, it outputs 52 records, with a count for each state, totaling up to 67 million, the SORT cost on this query for example is something in the area of about 64%. i forgot how to get export the execution plan to text or i would display that for you as well :(

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-13 : 23:51:44

btw, here's the index for this particular query.


USE [NatimarkOnline]
GO

/****** Object:  Index [_dta_index_Consumer_8_1819257636_52311312_K21_K62_K5_K20_K19_K24_K6_K2_163]    Script Date: 09/13/2009 20:54:13 ******/
CREATE NONCLUSTERED INDEX [_dta_index_Consumer_8_1819257636_52311312_K21_K62_K5_K20_K19_K24_K6_K2_163] ON [dbo].[consumer] 
(
	[NAME_TYPE] ASC,
	[HOME_BUSINESS] ASC,
	[STATE] ASC,
	[DO_NOT_CALL] ASC,
	[PHONE] ASC,
	[LAST_NAME] ASC,
	[ZIP] ASC,
	[STREET] ASC
)
INCLUDE ( [uid]) WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-09-14 : 01:11:47

You've shown us an index on the consumer table, however that table isn't involved in your query. Please show us all indexes on the MAIN and CONTACT tables.

Posting the execution plan would be very helpful so that we can see the weak spots.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-14 : 01:15:02

how do i get the execution plan in a format i can post here? isnt there a way to output it to text?

as for the index, these are the same tables really, the consumer table is the big massive one, i have the same issue on either the test.dbo. tables, or the consumer table.

i can post both though, just need to know how to export the execute plan to text.

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-14 : 17:06:19

hope this helps, looks like it's running a table scan on main, but the sort is still 42% of the workload. cutting that back will help ALOT... also, any reccommendations on indexing main? again, i'm totally nubbish when it comes to indexing.


  |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
       |--Stream Aggregate(GROUP BY:([m].[STATE]) DEFINE:([globalagg1011]=SUM([partialagg1010])))
            |--Sort(ORDER BY:([m].[STATE] ASC))
                 |--Parallelism(Gather Streams)
                      |--Hash Match(Partial Aggregate, HASH:([m].[STATE]), RESIDUAL:([Test].[dbo].[MAIN].[STATE] as [m].[STATE] = [Test].[dbo].[MAIN].[STATE] as [m].[STATE]) DEFINE:([partialagg1010]=COUNT(*)))
                           |--Filter(WHERE:([Expr1006]=(1)))
                                |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                                     |--Sequence Project(DEFINE:([Expr1006]=row_number))
                                          |--Segment
                                               |--Parallelism(Gather Streams, ORDER BY:([m].[PHONE] ASC, [m].[uid] DESC))
                                                    |--Sort(ORDER BY:([m].[PHONE] ASC, [m].[uid] DESC))
                                                         |--Hash Match(Inner Join, HASH:([c].[uid])=([m].[uid]), RESIDUAL:([Test].[dbo].[CONTACT].[uid] as [c].[uid]=[Test].[dbo].[MAIN].[uid] as [m].[uid]))
                                                              |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([c].[uid]))
                                                              |    |--Index Seek(OBJECT:([Test].[dbo].[CONTACT].[_dta_index_CONTACT_6_421576540_5370704_K1_K17] AS [c]), SEEK:([c].[NAME_TYPE]='P') ORDERED FORWARD)
                                                              |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m].[uid]))
                                                                   |--Table Scan(OBJECT:([Test].[dbo].[MAIN] AS [m]))

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-14 : 19:26:27

Okay, i added an index to fix the last table scan: |--Table Scan(OBJECT:([Test].[dbo].[MAIN] AS [m]

the Sort is still costing 53% of the query, that's what i'm trying to see if i can clenaup. unless you guys see something else.

now this is the new execution:


  |--Compute Scalar(DEFINE:([Expr1007]=CONVERT_IMPLICIT(int,[globalagg1011],0)))
       |--Stream Aggregate(GROUP BY:([m].[STATE]) DEFINE:([globalagg1011]=SUM([partialagg1010])))
            |--Sort(ORDER BY:([m].[STATE] ASC))
                 |--Parallelism(Gather Streams)
                      |--Hash Match(Partial Aggregate, HASH:([m].[STATE]), RESIDUAL:([Test].[dbo].[MAIN].[STATE] as [m].[STATE] = [Test].[dbo].[MAIN].[STATE] as [m].[STATE]) DEFINE:([partialagg1010]=COUNT(*)))
                           |--Filter(WHERE:([Expr1006]=(1)))
                                |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                                     |--Sequence Project(DEFINE:([Expr1006]=row_number))
                                          |--Segment
                                               |--Parallelism(Gather Streams, ORDER BY:([m].[PHONE] ASC, [m].[uid] DESC))
                                                    |--Sort(ORDER BY:([m].[PHONE] ASC, [m].[uid] DESC))
                                                         |--Hash Match(Inner Join, HASH:([c].[uid])=([m].[uid]), RESIDUAL:([Test].[dbo].[CONTACT].[uid] as [c].[uid]=[Test].[dbo].[MAIN].[uid] as [m].[uid]))
                                                              |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([c].[uid]))
                                                              |    |--Index Seek(OBJECT:([Test].[dbo].[CONTACT].[_dta_index_CONTACT_6_421576540_5370704_K1_K17] AS [c]), SEEK:([c].[NAME_TYPE]='P') ORDERED FORWARD)
                                                              |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([m].[uid]))
                                                                   |--Index Scan(OBJECT:([Test].[dbo].[MAIN].[_Index_main_state_count] AS [m]))

here are the indexes:


CREATE NONCLUSTERED INDEX [_Index_main_state_count] ON [dbo].[MAIN] 
(
	[STATE] ASC,
	[PHONE] ASC,
	[uid] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF, 
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON, 
ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]

CREATE NONCLUSTERED INDEX [_dta_index_CONTACT_6_421576540_5370704_K1_K17] ON [dbo].[CONTACT] 
(
	[NAME_TYPE] ASC,
	[uid] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS  = ON,
ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
GO

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2009-09-15 : 08:51:30

I think you need an index on MAIN with uid placed first....since it's part of the join.

CREATE NONCLUSTERED INDEX [_Index_main_state_count] ON [dbo].[MAIN]
(
[uid] ASC,
[STATE] ASC,
[PHONE] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, SORT_IN_TEMPDB = OFF,
IGNORE_DUP_KEY = OFF, DROP_EXISTING = OFF, ONLINE = OFF, ALLOW_ROW_LOCKS = ON,
ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]

I'd also query why you have ... "order by m.uid desc" for a "summary-by-state report" it appears to add little value. Can you experiment and take it out?

Given you have stats turned off...you may need to re-compute stats for the tables,incides after any major data changes. Sorting in TEMPDB may help. Again...Experiment? change one-factor, measure, repeat.

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-15 : 13:13:52

i'll check all those and get back to you, thanks ALOT. if you and tara could subscribe to
this thread, i'd REALLY appreciate it, i'll post back the results soon as i can (probably
tonight or tomorrow)

this is a test database, so i can change alot w/o having to worry about users getting pissy,
just takes a while with the tables being so large. as for re-computing stats, the database
is STATIC, it gets dropped and recreated every 3months, so i wouldnt need to worry about
re-computing correct?

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2009-09-16 : 12:04:16

re stats...I don't know...are the stats created/good before/after the data is loaded?

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-16 : 12:35:17

stats and indexes are dropped, data is imported, and then recreated
after we update it every quarter... i'm shuffling data around still
(having to make some free space) so i havnt had a chance to test
stuff yet, i can tell you though that with the order by m.uid, the
row_number() function requires the orderby clause, i believe that's
why it's there. but i'm still gunna test it a little later, just
waiting for the server to finish it's workload so i can play :D

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-19 : 13:55:29

Okay, so i've tried the indexing w/ the UID first, w/ the sort in tempdb on and off, and it seems like the original index is the fastest...

still running @ like 15 minutes though.... so heres my next question, lets simplify it, this query:


	SELECT state, count(*)
	 from (SELECT m.state, row_number() over
	(partition by phone order by m.uid desc) as recid
	FROM TEST.DBO.mem_MAIN m 
	  ) as f
	where recid = 1	
	group by state 
	order by state

its still taking 10 minutes

here's the execution:


  |--Compute Scalar(DEFINE:([Expr1004]=CONVERT_IMPLICIT(int,[globalagg1007],0)))
       |--Stream Aggregate(GROUP BY:([m].[STATE]) DEFINE:([globalagg1007]=SUM([partialagg1006])))
            |--Sort(ORDER BY:([m].[STATE] ASC))
                 |--Parallelism(Gather Streams)
                      |--Hash Match(Partial Aggregate, HASH:([m].[STATE]), RESIDUAL:([Test].[dbo].[mem_MAIN].[STATE] as [m].[STATE] = [Test].[dbo].[mem_MAIN].[STATE] as [m].[STATE]) DEFINE:([partialagg1006]=COUNT(*)))
                           |--Filter(WHERE:([Expr1003]=(1)))
                                |--Parallelism(Distribute Streams, RoundRobin Partitioning)
                                     |--Sequence Project(DEFINE:([Expr1003]=row_number))
                                          |--Segment
                                               |--Parallelism(Gather Streams, ORDER BY:([m].[PHONE] ASC, [m].[uid] DESC))
                                                    |--Sort(ORDER BY:([m].[PHONE] ASC, [m].[uid] DESC))
                                                         |--Index Scan(OBJECT:([Test].[dbo].[mem_MAIN].[_Index_mem_main_state_count] AS [m]))

Again, the |--Sort(ORDER BY:([m].[PHONE] ASC, [m].[uid] DESC)) line is taking 72% of the batch....

any ideas on how i can get a query like this back in 30 seconds or less?

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2009-09-22 : 04:24:58

Out of curiosity have you base-lined what a simple SELECT COUNT(*) from the main table would perform like? Is your expectation of the ability of the DB to process that amount of data too high for the hardware/volume of data involved?

Is there any way the table could be re-designed to remove the duplicate information?

Also...do you need "100%" accurate results to this query - is it an MIS/Approximate. Could you live with the duplicates being counted? Are they a large portion of the database? Could you live with a one-off snapshot of number/spread of the duplicates and apply those percentages to any "duplicate inclusive" numbers to produce a "very, very close to accurate" non-duplicate report...albeit in a far faster time than with the duplicates resolved?

You may be between a rock and a hard-place. Consider some alternatives, however outlandish they may seem.

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2009-09-23 : 08:51:54

Can you time this query and also verify if it gives the same result? ->

SELECT state, count(*) 
from 
  (SELECT m.state, phone, max(m.uid) as recid
  FROM TEST.DBO.MAIN m 
	  Join TEST.dbo.CONTACT c 
	    on m.uid = c.uid and c.Name_Type = 'P'
  group by m.state, phone
	) as f
group by state 
order by state

- Lumbago

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-23 : 17:28:22

Andrew:
the initial count takes about 1 minutes, however after that, every query for select count (*) takes about 2 seconds.
as for the db server, it's got 32gb of memory, dual xeon 64x quad cords, and about 6 hard drives (1 array for os, 1 for logs/sql install, 1 for data store) so it's fairly beast i think. maybe i'm wrong. as for how accuarate it would have to be, i would think if we go no more then .5% would be accepatable, however bosses my cry a little... i would really just have to see what happens when we start filling our requests and if customers complain.

Lumbago:
your query took about 1/2 the time mine did. and that's a 3 query average HOWEVER it also produced 30,000 more records then mine. however the universe is like 225M so it's about .45% off (that's less the 1/2% percent)

so whats the logic behind yours vs mine? and any way we can speed it up a little more?

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2009-09-24 : 03:32:10

Ok, the increased performance is mostly due to not using the row_number() function I would guess since you actually only need the highest "uid" for each phone number...right? I'd think that just doing MAX() is cheaper than first creating an ordered list and then choose the first in the order. I also moved the "c.Name_Type = 'P'" from the where clause to the join condition which is where I suspect the extra 30k rows are coming from. You should try and put it back in the where-clause and see if you get the right results.

I'm a bit rusty on the indexing part but you should definitely have an index on both MAIN and CONTACT with the uid as the first column in the index. Can you also post the execution plan for my query with the indexes I proposed...?

- Lumbago

Lumbago
Norsk Yak Master

3271 Posts

Posted - 2009-09-24 : 03:34:14

Peso should really be a part of this discussion..Peso; where are you man??

- Lumbago

AndrewMurphy
Master Smack Fu Yak Hacker

2916 Posts

Posted - 2009-09-24 : 10:39:59

"the initial count takes about 1 minutes, however after that, every query for select count (*) takes about 2 seconds."

This would seem to put a floor of 1min on how good your performance can get - and thus maybe imply that your initial requirement of 30secs may be unrealistic. The 2 second bit seems to indicate the benefit of CACHE. For true repeatable timings use the following between repetitions of code executing....unless you plan to run an automated job at the start of each day to geet the data into CACHE.

DBCC DROPCLEANBUFFERS
DBCC FREEPROCCACHE

One other option to look at may be "partitioning", where you can split the table/database over multiple spindles.

Any if you can answer Lumbago's last request, it may help us more.

albertkohl
Aged Yak Warrior

740 Posts

Posted - 2009-09-24 : 14:28:19

okay, so couple things, 1) I'm setting up the UID column to be the
PK. while doing this, I have to re-index everything anyway because I
need to drop and re-create the tables. so I'm setting the indexes
all up to start w/ UID...

I'll probably have that all done in a day or two, soon as I do I'll
post some more stuff.

Lumb: I'm still off qty wise w/ your query vs mine even when I
changed the name_type to the where clause. I'm kinda curious why
though.

Andrew: the two dbcc commands, is that something that's gunna take a
while to run every day? I would have no problem (I don't think)
running that every day at like 12:00am, are there any back-draws to
doing so?

as for partitioning, I kinda tried that w/ distributed-federated
views, worked like crap. how about maybe doing it locally only, like
have for example:

mem_contact1 with a name_type='P' constraint, and then a
mem_contact2 with a name_type<>'P' constraint, the massive majority
of our queries have the name_type='P' select so in that case I can
ignore the other crapoloa and weed down the universe of results
because the table will ONLY have P's in it, is that right?

what else? come on, hit me! I'm fired up now :D

tkizer
Almighty SQL Goddess

38200 Posts

Posted - 2009-09-24 : 16:46:35

Don't use partitioned views, look into partitioned tables. They are completely different technologies.

Tara Kizer
Microsoft MVP for Windows Server System - SQL Server
http://weblogs.sqlteam.com/tarad/

Subscribe to my blog

"Let's begin with the premise that everything you've done up until this point is wrong."

Previous Page&nsp; Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources