Clustering Cheat or Good Idea - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

SQL Server Development (2000)

Clustering Cheat or Good Idea

Author

Topic

SamC
White Water Yakist

3467 Posts

Posted - 2004-12-21 : 08:52:02

If I have a table

create table Mytable (
AdminID INT ... ,
EmailID INT )

There's no information in the table that's any good for clustering. The table gets large, and it would be good to cluster the information by "CourseID" - reports on the information are generally drawn on the course, but retrieving the CourseID through EmailID is a couple of tables away.

I want to cheat, and stuff CourseID into this table even though it won't be referenced just so I can rest easily clustering the data...

create table Mytable (
MytabID INT NOT NULL UNIQUE IDENTITY (1,1),
AdminID INT ... ,
EmailID INT ,
CourseID INT ,
constraint [MyCluster] UNIQUE CLUSTERED
CourseID,
MytabID )
)

This CLUSTER should collect all the data that will be queried in one convenient spot in the table. Probably all the data in under a few 10's of Kb of disk space.

Is this reasonable for the sake of performance improvement or should I let the data be scattered accross 10s of megabytes and use an INDEX?

X002548
Not Just a Number

15586 Posts

Posted - 2004-12-21 : 10:41:06

I don't get it...are you going after the data through courseid?

Why the IDENTITY?

Sample Data would shed some light I would imagine

And do you do Ice boat racing when the Kayak is put away?

Damn it's cold here

Brett

8-)

SamC
White Water Yakist

3467 Posts

Posted - 2004-12-21 : 11:29:43

quote:
Originally posted by X002548

I don't get it...are you going after the data through courseid?

Why the IDENTITY?

Course ID is a great cluster value, but the cluster has to be unique, so I added an IDENTITY.

It's a kludge but it does create a great cluster, at the expense of doubling the columns with data that's not needed.

I'd wondered if anyone else had done such a backbend to get a good cluster going..

Sam

No iceboating here. We've got hills to ski however.

Bustaz Kool
Master Smack Fu Yak Hacker

1834 Posts

Posted - 2004-12-21 : 13:04:12

Stray thoughts....

1) What is your table supposed to be? A set of Courses? Administrators? Classrooms? I can't see how a Course could have an EMail but you know the specifics of the situation better than I ever will.

2) "Course ID is a great cluster value, but the cluster has to be unique, so I added an IDENTITY." - Would you ever have two courses that had the same Course ID? If not, why do you need to make it unique since it already is? Even for the case where there are several instances of the same course, there is usually a means of distinguishing one from the others. This might result in a multi-valued key but who cares?

3) The clustering index is usually the primary key (and is by default) but this doesn't mean that it needs to be. The clustered index doesn't even need to be unique.

4) We usually cluster the data with the intention of using that index to SELECT or JOIN the data in the table. How do you envision getting the data? By AdminID? By EMailID?

5) "There's no information in the table that's any good for clustering." - Even if no convenient column or group of columns is a natural choice, it is useful to apply a clustered index to the table. Since all tables have a primary key, you can always build the clustered index to that.

6) "retrieving the CourseID through EmailID is a couple of tables away." - Joining together a couple of tables is not a real burden on the processing. It that is your biggest concern, I would not denormalize the schema. Other would though.

7) "We've got hills to ski however." - In Gaithersburg? I don't think so...

HTH

=================================================================

The best things in life are nearest: Breath in your nostrils, light in your eyes, flowers at your feet, duties at your hand, the path of right just before you. -Robert Louis Stevenson, novelist, essayist, and poet
(1850-1894)

X002548
Not Just a Number

15586 Posts

Posted - 2004-12-21 : 16:04:52

To be fair, Sam did say Hills, not Mountains

And Sam, you still got me stumped

Is there some performance thing you're seeing?

Brett

8-)

SamC
White Water Yakist

3467 Posts

Posted - 2004-12-22 : 08:52:58

I wish I hadn't asked this question. It's messy.

Nevermind the facts

lemme post the question generically.

It's my understanding that clustering data is a good thing because it can take related data and make it physically sequential in a segment of a table.

As a worst case, a 200 Mb table of "Company" data that is not clustered around the CompanyID could take a long time to query. Records are distributed randomly thru the 200 Mb translates into a lot of disk sector reads no matter how well it is indexed.

If the same table were Clustered around CompanyID, the theory is that retrieving all data form "Company A" is physically sequential). Saving a lot of disk sector reads.

For example consider a table of two pointers: EmployeeID and EmployeeCommentID). This is a common structure that builds a relationship between two other tables which may hold CompanyID and other data. The EmployeeID itself doesn't have a clustering benefit, but we could throw in another column "CompanyID" even though it isn't needed for querying and cluster on CompanyID, EmployeeID.

Queries could then be modified to JOIN on CompanyID, EmployeeID and yield the benefit of having all EmployeeIDs of a given company sequentially located in the 200 Mb table.

I thought someone might have a clear reason why this is a bad idea, useless idea... Comments welcomed here.

Taken to an extreme, I could see every table of "Company" data having CompanyID added for clustering purposes. Sounds like it isn't a good idea in this respect.

Bustaz Kool
Master Smack Fu Yak Hacker

1834 Posts

Posted - 2004-12-22 : 09:33:25

"The EmployeeID itself doesn't have a clustering benefit" - This is not true. Clustering on anything will provide some benefit. Clustering on the keys in your junction table will certianly be beneficial.

Now, if you had a database that held this type of data for a number of different companies, adding the CompanyID to the table and using it in the clustered index could be of benefit. But if you required the CompanyID in order to differentiate employee 12345 at IBM from employee 12345 at Micrsoft, the CompanyID would already be part of the key.

HTH

=================================================================

Sometimes I wonder whether the world is being run by smart people who are putting us on or by imbeciles who really mean it. -Mark Twain, author and humorist (1835-1910)

X002548
Not Just a Number

15586 Posts

Posted - 2004-12-22 : 10:24:00

quote:
Originally posted by Bustaz Kool

"The EmployeeID itself doesn't have a clustering benefit" - This is not true. Clustering on anything will provide some benefit. Clustering on the keys in your junction table will certianly be beneficial.

And the benefit depends on the cardinality of the data...

Are you talking about cascading keys down through the generations?

Brett

8-)

SamC
White Water Yakist

3467 Posts

Posted - 2004-12-22 : 10:30:09

I was running with the idea that EmployeeID was an IDENTITY and had no unique characteristic which tied it to a speicific company.

Bustaz Kool
Master Smack Fu Yak Hacker

1834 Posts

Posted - 2004-12-22 : 12:34:35

"I was running with the idea that EmployeeID was an IDENTITY and had no unique characteristic which tied it to a speicific company." - This is precisely the argument AGAINST using surrogate keys but that's another long, excruciatingly painful and usually pointless tangent. ;-)

HTH

=================================================================

Sometimes I wonder whether the world is being run by smart people who are putting us on or by imbeciles who really mean it. -Mark Twain, author and humorist (1835-1910)

SamC
White Water Yakist

3467 Posts

Posted - 2004-12-22 : 18:58:36

Thanks for pointing that out... again...

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2004-12-26 : 16:43:35

Denormalization to gain performance.... with a clustering twist

.
IMO it is always better to keep normalization.

Try using an indexed view, and see what perf improvements You get.
Build a reporting schema, where data is physically laid out to optimize reporting requirements.

rockmoose

SamC
White Water Yakist

3467 Posts

Posted - 2004-12-26 : 18:03:50

I decided against it. I don't have a performance issue that justifies this kind of mess'in around.

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2004-12-26 : 18:29:18

Your successor will thank You, and will speak well of You!
>>I don't have a performance issue that justifies this kind of mess'in around
Very seldom have.

rockmoose

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources