Help with tabledesign for optimal performance

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

Help with tabledesign for optimal performance

Author

Topic

memnon
Starting Member

1 Post

Posted - 2010-08-03 : 06:30:23

Hi,

I am not an sql-expert so i need you guys to help me with a tabledesign. The table is going to contain statistics over visitors in a shop. And the information i want to get out of this is the number of visitors between two dates.

example table:

(tiny-int, smalldatetime)

------------------------------
site_id site_datetime
------------------------------
4 2010-07-13 10:45:21
6 2010-07-14 11:09:36
8 2010-07-14 09:28:54
7 2010-07-16 08:14:11
4 2010-07-17 16:32:07

So the only column in the table that i need is id and datetime. For each visitor that enters the shop a new row is inserted into the table, so the size of the table is growing very fast. Millions of rows eventually i guess. The sql query i use to get the statistics i want looks something like this:

example get the visitour count for a specific store between two dates:

select count(id) AS visitors from tbl_statistics where site_datetime >= '2009-01-01' AND site_datetime < '2010-01-01' AND site_id = 8;

Now here comes my questions:

How do i design this table for optimal performance considering the amount of rows it's going to contain and the select statements that are to be used?
Must i have an extra column with a primary key?
Which type of indexes shall i have and on which colums (clustered or non clustered)?
Stored procedures??

I really need your help on this one cause as i said im a beginner in this field.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-03 : 08:36:50

Which version of SQL Server are you using?
If you are using SQL Server 2008, you can create a filtered index per site_id which will make your queries very fast.
Another suggestion is to make the clustered index over {site_id, site_datetime}.

N 56°04'39.26"
E 12°55'05.63"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-03 : 08:38:43

If you don't need the time part in any of your calculations, use DATE datatype instead.
That will give you a total of 4 bytes per row instead of 9. This will make your queries almost twice as fast.

N 56°04'39.26"
E 12°55'05.63"

Kristen
Test

22859 Posts

Posted - 2010-08-03 : 08:45:12

"Another suggestion is to make the clustered index over {site_id, site_datetime}."

Would you prefer that to {site_datetime, site_id}?

Will SQL use an index that has a low-selective first column? (well, I suppose it will have to if its the Clustered Index

- and maybe that's your point?)

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-03 : 08:52:53

The only reason right now, is that I know nothing about the data and certainly nothing about the queries.
I just assume the site_id's are evenly distributed and the site_datetime is ever increasing.
The caveat is that we don't know about the ranges used for the queries.

Are they always small ranges, such as a day? Two days?
Are the ranges more often larger, such as a month? A quarter? Or as in the example above, a year?

I think I will have to do a test with a million record table.

N 56°04'39.26"
E 12°55'05.63"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-03 : 09:25:10

Here are the results

CREATE TABLE	#Sites
		(
			SiteID TINYINT IDENTITY(1, 1) PRIMARY KEY CLUSTERED
		)
GO

INSERT	#Sites
DEFAULT	VALUES
GO 100

CREATE TABLE	#Dates
		(
			RowID INT IDENTITY(1, 1) NOT NULL,
			dt AS (DATEADD(DAY, RowID, 36523)) PERSISTED
		)
GO

INSERT	#Dates
DEFAULT	VALUES
GO 10000

CREATE TABLE	#Sample
		(
			SiteID TINYINT NOT NULL,
			dt DATETIME NOT NULL
		)
GO

CREATE CLUSTERED INDEX CX_Sample ON #Sample (SiteID, dt)
GO

INSERT		#Sample
		(
			SiteID,
			dt
		)
SELECT		SiteID,
		dt
FROM		#Sites
CROSS JOIN	#Dates
GO

SELECT	COUNT(*) AS Visitors
FROM	#Sample
WHERE	dt >= '20090101'
	AND dt < '20100101'
	AND SiteID = 8
	
/* Clustered index over SiteID, dt
Table '#Sample'. Scan count 1, logical reads 4.
*/
GO

DROP INDEX #Sample.CX_Sample
GO

CREATE CLUSTERED INDEX CX_Sample ON #Sample (dt, SiteID)
GO

SELECT	COUNT(*) AS Visitors
FROM	#Sample
WHERE	dt >= '20090101'
	AND dt < '20100101'
	AND SiteID = 8
/* Clustered index over dt, SiteID
Table '#Sample'. Scan count 1, logical reads 85.
*/
GO

DROP TABLE	#Dates,
		#Sites,
		#Sample
GO

N 56°04'39.26"
E 12°55'05.63"

Kristen
Test

22859 Posts

Posted - 2010-08-03 : 09:29:34

You got some timings? (to save me having to rerun your test

). Thanks.

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-03 : 09:51:21

First query, 0 ms.
Second query, 6 ms.

N 56°04'39.26"
E 12°55'05.63"

X002548
Not Just a Number

15586 Posts

Posted - 2010-08-03 : 09:53:36

I've seen this request before...I believe it is project work for a college class

Brett

8-)

Hint: Want your questions answered fast? Follow the direction in this link
http://weblogs.sqlteam.com/brettk/archive/2005/05/25/5276.aspx

Add yourself!
http://www.frappr.com/sqlteam

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-03 : 10:20:58

I increased the sitecount to 1000, and the dates to 100,000.

First query - 0 ms, 1 scan and 5 logical reads.
Second query - CPU 47, Duration 35, scan 1 and logical reads 864.

N 56°04'39.26"
E 12°55'05.63"

Kristen
Test

22859 Posts

Posted - 2010-08-03 : 10:38:18

Interesting, I would have expected it to go the other way.

But if its doing a SCAN then site-first makes sense. And, I reckon, because its the Clustered Index the selectivity of first column is irrelevant.

Whereas if you tried a Non-clustered index (and maybe had a clustered index on an IDENTITY column) then perhaps the Non-clustered index would be ignored as not seelctive enough?

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-03 : 11:33:15

I have to do all the work myself?

N 56°04'39.26"
E 12°55'05.63"

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-03 : 11:37:11

With either index, the query does a clustered index seek.

N 56°04'39.26"
E 12°55'05.63"

Kristen
Test

22859 Posts

Posted - 2010-08-03 : 11:46:12

quote:
Originally posted by Peso

I have to do all the work myself?

Sure. You're the optimisation specialist. I read it on the web somewhere

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2010-08-03 : 12:40:28

It makes sense if you think about how indexes work. If you can seek right to the SiteID, then get the date range is/should be much faster than looking for a date range over all the records, then filter by the SiteId precicate. But, who knows what the actual queries are like.. ;)

Kristen
Test

22859 Posts

Posted - 2010-08-03 : 13:00:55

Yeah, that makes sense provided that the first column is sufficiently selective - which I would not expect SiteID to be - and SQL doesn't (as far as I know) keep statistics for the combination of columns.

But maybe it is smart enough to work out that single-value on first key fields and range-on-second must be worthwhile using the index? What if the index was Non-clustered and the query was going to select all rows? that would not be worth using an index - hence, surely, Statistics are required even then?

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2010-08-04 : 11:41:12