| Author |
Topic |
|
memnon
Starting Member
1 Post |
Posted - 2010-08-03 : 06:30:23
|
| Hi,I am not an sql-expert so i need you guys to help me with a tabledesign. The table is going to contain statistics over visitors in a shop. And the information i want to get out of this is the number of visitors between two dates.example table:(tiny-int, smalldatetime)------------------------------site_id site_datetime------------------------------4 2010-07-13 10:45:216 2010-07-14 11:09:368 2010-07-14 09:28:547 2010-07-16 08:14:114 2010-07-17 16:32:07So the only column in the table that i need is id and datetime. For each visitor that enters the shop a new row is inserted into the table, so the size of the table is growing very fast. Millions of rows eventually i guess. The sql query i use to get the statistics i want looks something like this:example get the visitour count for a specific store between two dates:select count(id) AS visitors from tbl_statistics where site_datetime >= '2009-01-01' AND site_datetime < '2010-01-01' AND site_id = 8;Now here comes my questions:How do i design this table for optimal performance considering the amount of rows it's going to contain and the select statements that are to be used?Must i have an extra column with a primary key?Which type of indexes shall i have and on which colums (clustered or non clustered)?Stored procedures??I really need your help on this one cause as i said im a beginner in this field. |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-03 : 08:36:50
|
Which version of SQL Server are you using?If you are using SQL Server 2008, you can create a filtered index per site_id which will make your queries very fast.Another suggestion is to make the clustered index over {site_id, site_datetime}. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-03 : 08:38:43
|
If you don't need the time part in any of your calculations, use DATE datatype instead. That will give you a total of 4 bytes per row instead of 9. This will make your queries almost twice as fast. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-08-03 : 08:45:12
|
"Another suggestion is to make the clustered index over {site_id, site_datetime}."Would you prefer that to {site_datetime, site_id}?Will SQL use an index that has a low-selective first column? (well, I suppose it will have to if its the Clustered Index - and maybe that's your point?) |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-03 : 08:52:53
|
The only reason right now, is that I know nothing about the data and certainly nothing about the queries.I just assume the site_id's are evenly distributed and the site_datetime is ever increasing.The caveat is that we don't know about the ranges used for the queries.Are they always small ranges, such as a day? Two days?Are the ranges more often larger, such as a month? A quarter? Or as in the example above, a year?I think I will have to do a test with a million record table. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-03 : 09:25:10
|
Here are the resultsCREATE TABLE #Sites ( SiteID TINYINT IDENTITY(1, 1) PRIMARY KEY CLUSTERED )GOINSERT #SitesDEFAULT VALUESGO 100CREATE TABLE #Dates ( RowID INT IDENTITY(1, 1) NOT NULL, dt AS (DATEADD(DAY, RowID, 36523)) PERSISTED )GOINSERT #DatesDEFAULT VALUESGO 10000CREATE TABLE #Sample ( SiteID TINYINT NOT NULL, dt DATETIME NOT NULL )GOCREATE CLUSTERED INDEX CX_Sample ON #Sample (SiteID, dt)GOINSERT #Sample ( SiteID, dt )SELECT SiteID, dtFROM #SitesCROSS JOIN #DatesGOSELECT COUNT(*) AS VisitorsFROM #SampleWHERE dt >= '20090101' AND dt < '20100101' AND SiteID = 8 /* Clustered index over SiteID, dtTable '#Sample'. Scan count 1, logical reads 4.*/GODROP INDEX #Sample.CX_SampleGOCREATE CLUSTERED INDEX CX_Sample ON #Sample (dt, SiteID)GOSELECT COUNT(*) AS VisitorsFROM #SampleWHERE dt >= '20090101' AND dt < '20100101' AND SiteID = 8/* Clustered index over dt, SiteIDTable '#Sample'. Scan count 1, logical reads 85.*/GODROP TABLE #Dates, #Sites, #SampleGO N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-08-03 : 09:29:34
|
You got some timings? (to save me having to rerun your test ). Thanks. |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-03 : 09:51:21
|
First query, 0 ms.Second query, 6 ms. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
X002548
Not Just a Number
15586 Posts |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-03 : 10:20:58
|
I increased the sitecount to 1000, and the dates to 100,000.First query - 0 ms, 1 scan and 5 logical reads.Second query - CPU 47, Duration 35, scan 1 and logical reads 864. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-08-03 : 10:38:18
|
| Interesting, I would have expected it to go the other way.But if its doing a SCAN then site-first makes sense. And, I reckon, because its the Clustered Index the selectivity of first column is irrelevant.Whereas if you tried a Non-clustered index (and maybe had a clustered index on an IDENTITY column) then perhaps the Non-clustered index would be ignored as not seelctive enough? |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-03 : 11:33:15
|
I have to do all the work myself?  N 56°04'39.26"E 12°55'05.63" |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-03 : 11:37:11
|
With either index, the query does a clustered index seek. N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-08-03 : 11:46:12
|
quote: Originally posted by Peso I have to do all the work myself? 
Sure. You're the optimisation specialist. I read it on the web somewhere |
 |
|
|
Lamprey
Master Smack Fu Yak Hacker
4614 Posts |
Posted - 2010-08-03 : 12:40:28
|
| It makes sense if you think about how indexes work. If you can seek right to the SiteID, then get the date range is/should be much faster than looking for a date range over all the records, then filter by the SiteId precicate. But, who knows what the actual queries are like.. ;) |
 |
|
|
Kristen
Test
22859 Posts |
Posted - 2010-08-03 : 13:00:55
|
| Yeah, that makes sense provided that the first column is sufficiently selective - which I would not expect SiteID to be - and SQL doesn't (as far as I know) keep statistics for the combination of columns.But maybe it is smart enough to work out that single-value on first key fields and range-on-second must be worthwhile using the index? What if the index was Non-clustered and the query was going to select all rows? that would not be worth using an index - hence, surely, Statistics are required even then? |
 |
|
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
Posted - 2010-08-04 : 11:41:12
|
quote: Originally posted by Kristen Sure. You're the optimisation specialist. I read it on the web somewhere 
I'll have a talk with Spirit1 later  N 56°04'39.26"E 12°55'05.63" |
 |
|
|
Kristen
Test
22859 Posts |
|
|
|