Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
Antonio
Posting Yak Master
168 Posts |
Posted - 2006-08-14 : 06:14:42
|
| Greetings my SQL friends.Could somebody give me a brief and sweet explanation of the difference between a clustered and a non clustered index and when they should be used.I have read several articles but I am after the opinions of the developers who are members on this forum.I hope this becomes a good discussion about this topic.Thanks in advance._________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; and the best companion is submission to Divine Will. |
|
|
Wanderer
Master Smack Fu Yak Hacker
1168 Posts |
Posted - 2006-08-14 : 06:42:38
|
Clustered indexes form the data structure that contains the data of a table, and order that data per their key.Non-clustered indexes are structures that use their key to point to thr RID of the data they represent - that RID may be in a clustered index, or in a heap (if there are no clustered indexes on a table).There can only be one clustered index on a table, whereas there may be 'n' non-clustered indexes.Both clustered indexes and non-clustered indexes may be etiher unique or non-unqiue.note per BOL re: clustered indexes:quote: Note: When you create a PRIMARY KEY constraint, a unique index on the column, or columns, is automatically created. By default, this index is clustered; however, you can specify a nonclustered index when you create the constraint.
usage.... this is a big field.Indexes can have a few uses - the list below is NOT exhaustive:uniqueness of a column, or columns, is enforced via unique indexes. You may thus see the need to define several unique indexes on a table that has several candidate keys.access paths - the most common use of an index is to improve the performace of queries. This typically requires some decent usage data, or a good understanding of the data, and the transaction/statement requirement - ideally both. This is the most common area of discussion re: indexes - I'll give a little more detail here.Sorting. Ideally, you should know how your data is most likely to be accessed, and sorted. If you know this, you would typically want to define your clustering index based on that sort order, since this will give you the data stored in the order it is most likely to be sorted in, and improve your access path. This is the primary (afaik) use for clustered indexes (from the access path point of view).Non-clustered indexes may come in several 'sub-categories- - these are logical groupsing, since there is no real difference in syntax, or to SQL.- foreign key indexes. Since most joins will happen on foreign keys, it is usually a good starting point to ensure that foreign keys are supported by a index. Naturally, you need to review to see if they are being used, but they often will be.- covering indexes. particularly important, or slow, statements/procedures/transactions may require the addition of what are known as covering indexes. These indexes will include several columns, with the aim being to satisfy an access requirement completely, so that no table access is required. This typically improves performance, often significantly. NOTE - be careful. Do not start creating multiple covering indexes on the same table, as the likelyihood is that you will introduce performance problems on any insert/update/delete operations, when columns appear in many indexes.- lookup indexes. commonly used when the means of access for a table is not the primary key, but is used a lot.Remember - indexes improve access performance, and can even improve modify operations where self-referencing occurs, but over-indexing of a table can lead to negative performance where modification can have a worsened performance, which in extreme cases may introduce contention due to the length of the modify operations, and can thus negatively impact overall system performance.When indexing, make sure the index added is necessary, used, and the only one of that design (it is possible to create the same index, for the same columns, with different names, if you aren't careful. That is redundant, and gives no advantages).Finally, remember - frequently modified/updated columns hsould be avoided in indexes, if possible, since each change to the column require a change on teh data and the index(es) - so 1 IO operation becomes 2 (or more).HTH - I look forward to hearing, and learning, more from other responses here.*##* *##* *##* *##* Chaos, Disorder and Panic ... my work is done here! |
 |
|
|
Antonio
Posting Yak Master
168 Posts |
Posted - 2006-08-14 : 07:35:29
|
quote: Clustered indexes form the data structure that contains the data of a table, and order that data per their key.
Okay that's easy to understand. So if I have a the following table :Table AColumn1 Column21 Data12 Data23 Data3a clustered index on Column1 would physically store my data in the order above right?quote: Non-clustered indexes are structures that use their key to point to thr RID of the data they represent - that RID may be in a clustered index, or in a heap (if there are no clustered indexes on a table).
What is a RID? So how would a non clustered index look like for my simple table?quote: There can only be one clustered index on a table, whereas there may be 'n' non-clustered indexes.
Yes, that is very logical as a clustered index physically sorts the data so you can't have more than one as the order of the data would have to change. I hope I am right with my understanding.quote: uniqueness of a column, or columns, is enforced via unique indexes. You may thus see the need to define several unique indexes on a table that has several candidate keys.
is that the only way to enforce uniqueness?By the way, thank you for the lengthy reply. Much appreciated my friend.I am trying to learn as much as possible from you guys. So the more this thread grows with opinions the better._________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; and the best companion is submission to Divine Will. |
 |
|
|
Wanderer
Master Smack Fu Yak Hacker
1168 Posts |
Posted - 2006-08-14 : 08:15:07
|
quote: Originally posted by Antonio
quote: Clustered indexes form the data structure that contains the data of a table, and order that data per their key.
Okay that's easy to understand. So if I have a the following table :Table AColumn1 Column21 Data12 Data23 Data3a clustered index on Column1 would physically store my data in the order above right?
yes. As more data get's inserted, deleted etc. over an extended period of time, it is possible for data to become 'un-clustered', but regular maintenace jobs (DBCC Indexdefrag and.or dbcc dbreindexes) will resolve that.quote: Originally posted by Antonioquote: Non-clustered indexes are structures that use their key to point to thr RID of the data they represent - that RID may be in a clustered index, or in a heap (if there are no clustered indexes on a table).
What is a RID? So how would a non clustered index look like for my simple table?
A RID = Row Identifier. It is the internal reference num ber SQL server allocated to rows of data to be able to point to, and find, rows. You will virtually never deal with RIDs per se (I haven't in 6 years of working with SQL).quote: Originally posted by Antonioquote: There can only be one clustered index on a table, whereas there may be 'n' non-clustered indexes.
Yes, that is very logical as a clustered index physically sorts the data so you can't have more than one as the order of the data would have to change. I hope I am right with my understanding.
spot onquote: Originally posted by Antonioquote: uniqueness of a column, or columns, is enforced via unique indexes. You may thus see the need to define several unique indexes on a table that has several candidate keys.
is that the only way to enforce uniqueness?
Inside SQL server - yes. Uniquesness constraints create unique indexes to enforce. tEchnically, you could probably write a trigger to check a row that had been inserted, and reject the insert if there was a duplicate, but I suspect that would be in-efficient, and certainly (imho) inadvisablequote: Originally posted by AntonioBy the way, thank you for the lengthy reply. Much appreciated my friend.I am trying to learn as much as possible from you guys. So the more this thread grows with opinions the better._________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; the best companion is submission to Divine Will.
np - I hope to learn as well. I haven't delved into selectivity and how you decide on what keys should be where in the index. Also, not my comments above about index maintenance. Without regular maintenance, indexes become less and less useful, and eventually, SQL may even not use them.*##* *##* *##* *##* Chaos, Disorder and Panic ... my work is done here! |
 |
|
|
Antonio
Posting Yak Master
168 Posts |
Posted - 2006-08-14 : 08:21:19
|
quote: A RID = Row Identifier. It is the internal reference num ber SQL server allocated to rows of data to be able to point to, and find, rows. You will virtually never deal with RIDs per se (I haven't in 6 years of working with SQL).
So how does SQL Server hold these RIDs?!If we go back to my table of 2 columns and 3 rows. If I create an index on Column 2, does sql server create some sort of internal table containing 3 RIDs which tells sql server which page and whereabouts in the page that particular record is held?Sorry if I sound too simplistic but I really need to get this clear in my head._________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; and the best companion is submission to Divine Will. |
 |
|
|
nr
SQLTeam MVY
12543 Posts |
Posted - 2006-08-14 : 08:32:24
|
| I would suggest you get a copy of inside sql server - it covers this sort of thing quite well.In a clustered index the leaf level is the data row (hence you can only have one on a table).A non clustered holds a copy of the columns being indexed and a reference to the data row.Common mistake - the clustered index is nothing to do with the PK - sql server just makes a PK clustered by default if there isn't already a clustered index on the table.An index needs to be able to reference the row it is pointing at.If the clustered index is not unique then a unique dentifier (4 bytes) is added to the clustered index.If the table has no clustered index then the indexes will reference the rows in the data page via the rid (row identifier) (8 bytes - page address, file id, slot within page)If the table has a clustered index the non clustered indexed references the clustered index - i.e. contains a copy of the values for the clustered index.So if your clustered index is wide then all non-clustered indexes are also wide.Note the clustered index values are only added to the non clustered index if they are not already included.So the effect of having a wide clustered index is to make all other indexes wide.Making a clustered index non unique will add the 4 byte unique identifier to all indexes.Usually this sort of thing doesn't matter and it is best to go with the clustered index most applicable for the queries used.SQL Server efficiency is all about usage of memory. The fewer disk reads the faster things will run. Indexes are a means of reducing the number of disk reads. You also need to look at the code being executed as often it can be rewritten so that an index is not needed. Same goes for the schema design.==========================================Cursors are useful if you don't know sql.DTS can be used in a similar way.Beer is not cold and it isn't fizzy. |
 |
|
|
Wanderer
Master Smack Fu Yak Hacker
1168 Posts |
Posted - 2006-08-14 : 08:40:28
|
quote: Originally posted by Antonio
quote: A RID = Row Identifier. It is the internal reference num ber SQL server allocated to rows of data to be able to point to, and find, rows. You will virtually never deal with RIDs per se (I haven't in 6 years of working with SQL).
So how does SQL Server hold these RIDs?!
they are maintained by SQL on the indexes pages. As I tried to clarify, they are the means that the non-clustered index can 'point' to the actual data in the clustered index (or heap if there is no clustered index). RID may confuse, so it would be best to forget about them - they are an internal method sql uses to point to the right data.quote: Originally posted by AntonioIf we go back to my table of 2 columns and 3 rows. If I create an index on Column 2, does sql server create some sort of internal table containing 3 RIDs which tells sql server which page and whereabouts in the page that particular record is held?
effectively, yes. SQL create a b-tree structure (a seperate physical structre). Per BOL:quote: A nonclustered index contains the index key values and row locators that point to the storage location of the table data.
Those row locators are the rids.so, for example, based on your data, an index on column 2 might have:col2 RIDData1 RID2342Data2 RID2343Data3 RID8835which might point to your clustered index:RID colRID2342 1RID2343 2RID8835 3so for a query like:SELECT col1,col2 from TablaA where COL2 = 'Data2'sql might access the non-clusterd index , and find the :Data2 RID2343then use RID2343 to access the clustered index:RID2343 2, and then retrieve the data.That is a made-up example. This may not be exactly how SQL would evaluate the select, but hopefully it clarifies the example?quote: Originally posted by AntonioSorry if I sound too simplistic but I really need to get this clear in my head._________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; and the best companion is submission to Divine Will.
RIDs are used by SQL to identify page*##* *##* *##* *##* Chaos, Disorder and Panic ... my work is done here! |
 |
|
|
Antonio
Posting Yak Master
168 Posts |
Posted - 2006-08-14 : 09:01:42
|
quote: In a clustered index the leaf level is the data row (hence you can only have one on a table).
Okay this is where I get confused. What is the leaf level?! I mean, when a clustered index is created, is all that happens is the physical re-ordering of the data on disk? am I being stupid here?quote: A non clustered holds a copy of the columns being indexed and a reference to the data row.
Okay that's fair enough, but where is this stored? A system table that SQL Server uses?quote: An index needs to be able to reference the row it is pointing at.If the clustered index is not unique then a unique dentifier (4 bytes) is added to the clustered index.
Understood.quote: If the table has no clustered index then the indexes will reference the rows in the data page via the rid (row identifier) (8 bytes - page address, file id, slot within page)
So what happens if you apply a covering index? Will 8 bytes be enough to hold the information?quote: SQL Server efficiency is all about usage of memory. The fewer disk reads the faster things will run. Indexes are a means of reducing the number of disk reads. You also need to look at the code being executed as often it can be rewritten so that an index is not needed. Same goes for the schema design.
As that possible?! If you're writing queries which reference tables that have millions of records surely indexes on the columns you join on are vital no?_________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; and the best companion is submission to Divine Will. |
 |
|
|
nr
SQLTeam MVY
12543 Posts |
Posted - 2006-08-14 : 09:20:09
|
| >> What is the leaf level?! Leaf level is the lowest level of the index.Think of a tree. The highest level is the root and the lowest the leaf.For a clustered index this is the data row, for a non-clustered a pointer to it or the clustered index entry.>> when a clustered index is created, is all that happens is the physical re-ordering of the data on disk?No - the data is ordered but also a b-tree is created to access the data. Othewise you would have to index scan to access the data.>> but where is this stored? A system table that SQL Server uses?It's stored in internal data - sysindexes gives a view of what structures are there and you can see the physical layout via dbcc page but it's not something that most people need to worry about. Get Inside SQL Server for more details.>> So what happens if you apply a covering index? Will 8 bytes be enough to hold the information?A clustered index holds a copy of the columns and a reference to the clustered index entry or rid in it's leaf level. A covering index refers to the query being used not to the index reference to the data row. It just means that the data can be retrieved from the index without recourse to the data row (for that query).>> As that possible?!Depends on the data and queries needed. Can be ok just to get the keys needed then get the data from those keys.I see a lot of systems that apply covering indexes rather than use the above method because sql server often table scans in that situation (can be due to out of date statistics). In v2005 you can include columns in the leaf level which don't expand the size of the b-tree to make indexes covering with less impact on efficiency.==========================================Cursors are useful if you don't know sql.DTS can be used in a similar way.Beer is not cold and it isn't fizzy. |
 |
|
|
Wanderer
Master Smack Fu Yak Hacker
1168 Posts |
Posted - 2006-08-14 : 09:37:40
|
quote: Originally posted by Antonio
quote: In a clustered index the leaf level is the data row (hence you can only have one on a table).
Okay this is where I get confused. What is the leaf level?! I mean, when a clustered index is created, is all that happens is the physical re-ordering of the data on disk? am I being stupid here?
NR's point about getting the inside SQL Server book would probably be a good way to go. However, I'll try and explain further.for a point of reference - SQL Server creates, and stores, all of it's data in PAGES (8K is size) ON THE FILESYSTEM. For example, the default database will have 2 files - an 'DatabaseName.MDF' and a 'DatabaseName.ldf' file. The ldf is the Log file, and for our purposes is not relevant. The MDF is the data file (I think MDF = Master Data File, but I could be wrong, and it isn't really relevant). The important thing is - that OS file is where SQL stores the data for a database.Inside that file, SQL defines the file structure. Inside that physical file, SQL manages structures of 8K pages that contain all the data. This includes the clustered indexes for tables with a clustered index. Now, SQL naturally needs to know where it's pages are, which pages are related, etc. This is maintained inside the page structure - the 8K page contains information about what strucutre it is part of etc.For a clustered index, the structure is several 8K pages, These pages create the B-tree strucutre that SQL uses to traverse to get to the actual data. An index typically has 3 'types' of pages: ROOT, Non-leaf (not sure if that is the correct technical term), and LEAF pages. This is what NR was referring to. In the B-Tree structure, the LEAF pages are the 'end pages' - there are no pages at a lower level. Because of that, they are the places where the data is actually held. The metaphor of B-Trees is a Tree. The root is the 'beginning point' of a tree, and the leaves are the 'end points'.For example: [RootPage] / | [non-leaf] [non-leaf] / | \ / |[leaf] [leaf] [leaf] [leaf] [leaf] The diagram I'm showing roughly represents a B-Tree. The root and non-leaf pages are used to determine which leaf page to get to - they contain pointers to those pages. When you get to the correct leaf page, the search inside that pages returns the cirrect RID, and - since it is the leaf page - the actual data, as inserted in the INSERT statem ent, resides there, and SQL returns it.A non-clustered index is similar, EXCEPT that the 'end nodes' of the B-Tree do not have all the data - they only have the data for the columns in the key that defines the index. They also contain the RID to be able to jump directly to the correct leaf node(s) of the clustered index. I hope that doesn't add to the confusion.quote: Originally posted by Antonioquote: A non clustered holds a copy of the columns being indexed and a reference to the data row.
Okay that's fair enough, but where is this stored? A system table that SQL Server uses?
see above explanation. All data is held inside that database file(s) (.mdf, or any additional files created - typically named .ndf's). SQL manages the internal structure of those files, and understands how to get to the correct pages to retrieve the data required.quote: Originally posted by Antonioquote: An index needs to be able to reference the row it is pointing at.If the clustered index is not unique then a unique dentifier (4 bytes) is added to the clustered index.
Understood.quote: If the table has no clustered index then the indexes will reference the rows in the data page via the rid (row identifier) (8 bytes - page address, file id, slot within page)
So what happens if you apply a covering index? Will 8 bytes be enough to hold the information?
8 bytes isn't the limit to the size of the covering index. the bytes NR refer's to are extra, to be used to get the the rest of the data not stored in the covering index keys. if a covering index has 3 of 5 columns of a table, then the 4 bytes are used to go to the leaf node of the clustering index IFF (if and only if) SQL decides to (a) use the covering index, and (b) the covering index doesn't satisfy all the data needs (let's say you wrote a selecty *) - it uses the reference to the clustering index to get the remaining 2 columns of data to satify the query.quote: Originally posted by Antonioquote: SQL Server efficiency is all about usage of memory. The fewer disk reads the faster things will run. Indexes are a means of reducing the number of disk reads. You also need to look at the code being executed as often it can be rewritten so that an index is not needed. Same goes for the schema design.
As that possible?! If you're writing queries which reference tables that have millions of records surely indexes on the columns you join on are vital no?_________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; and the best companion is submission to Divine Will.
yes - indexes are vitaly on any large table accesses, especially joins. I think NR is trying to say that indexes are not the only means of improving performance. for example, if you needs 2 columns from a 10 column table, but the current select is a select *, then a covering index on all 10 column is not appropriate. Change the SQL to select on the 2 columns, and then the index on those 2 columns would get used....*##* *##* *##* *##* Chaos, Disorder and Panic ... my work is done here!--I wonder if my 100'th post will be in this thread? |
 |
|
|
Antonio
Posting Yak Master
168 Posts |
Posted - 2006-08-17 : 11:13:43
|
quote: Originally posted by Wanderer
quote: Originally posted by Antonio
quote: In a clustered index the leaf level is the data row (hence you can only have one on a table).
Okay this is where I get confused. What is the leaf level?! I mean, when a clustered index is created, is all that happens is the physical re-ordering of the data on disk? am I being stupid here?
NR's point about getting the inside SQL Server book would probably be a good way to go. However, I'll try and explain further.for a point of reference - SQL Server creates, and stores, all of it's data in PAGES (8K is size) ON THE FILESYSTEM. For example, the default database will have 2 files - an 'DatabaseName.MDF' and a 'DatabaseName.ldf' file. The ldf is the Log file, and for our purposes is not relevant. The MDF is the data file (I think MDF = Master Data File, but I could be wrong, and it isn't really relevant). The important thing is - that OS file is where SQL stores the data for a database.Inside that file, SQL defines the file structure. Inside that physical file, SQL manages structures of 8K pages that contain all the data. This includes the clustered indexes for tables with a clustered index. Now, SQL naturally needs to know where it's pages are, which pages are related, etc. This is maintained inside the page structure - the 8K page contains information about what strucutre it is part of etc.For a clustered index, the structure is several 8K pages, These pages create the B-tree strucutre that SQL uses to traverse to get to the actual data. An index typically has 3 'types' of pages: ROOT, Non-leaf (not sure if that is the correct technical term), and LEAF pages. This is what NR was referring to. In the B-Tree structure, the LEAF pages are the 'end pages' - there are no pages at a lower level. Because of that, they are the places where the data is actually held. The metaphor of B-Trees is a Tree. The root is the 'beginning point' of a tree, and the leaves are the 'end points'.For example: [RootPage] / | [non-leaf] [non-leaf] / | \ / |[leaf] [leaf] [leaf] [leaf] [leaf] The diagram I'm showing roughly represents a B-Tree. The root and non-leaf pages are used to determine which leaf page to get to - they contain pointers to those pages. When you get to the correct leaf page, the search inside that pages returns the cirrect RID, and - since it is the leaf page - the actual data, as inserted in the INSERT statem ent, resides there, and SQL returns it.A non-clustered index is similar, EXCEPT that the 'end nodes' of the B-Tree do not have all the data - they only have the data for the columns in the key that defines the index. They also contain the RID to be able to jump directly to the correct leaf node(s) of the clustered index. I hope that doesn't add to the confusion.quote: Originally posted by Antonioquote: A non clustered holds a copy of the columns being indexed and a reference to the data row.
Okay that's fair enough, but where is this stored? A system table that SQL Server uses?
see above explanation. All data is held inside that database file(s) (.mdf, or any additional files created - typically named .ndf's). SQL manages the internal structure of those files, and understands how to get to the correct pages to retrieve the data required.quote: Originally posted by Antonioquote: An index needs to be able to reference the row it is pointing at.If the clustered index is not unique then a unique dentifier (4 bytes) is added to the clustered index.
Understood.quote: If the table has no clustered index then the indexes will reference the rows in the data page via the rid (row identifier) (8 bytes - page address, file id, slot within page)
So what happens if you apply a covering index? Will 8 bytes be enough to hold the information?
8 bytes isn't the limit to the size of the covering index. the bytes NR refer's to are extra, to be used to get the the rest of the data not stored in the covering index keys. if a covering index has 3 of 5 columns of a table, then the 4 bytes are used to go to the leaf node of the clustering index IFF (if and only if) SQL decides to (a) use the covering index, and (b) the covering index doesn't satisfy all the data needs (let's say you wrote a selecty *) - it uses the reference to the clustering index to get the remaining 2 columns of data to satify the query.quote: Originally posted by Antonioquote: SQL Server efficiency is all about usage of memory. The fewer disk reads the faster things will run. Indexes are a means of reducing the number of disk reads. You also need to look at the code being executed as often it can be rewritten so that an index is not needed. Same goes for the schema design.
As that possible?! If you're writing queries which reference tables that have millions of records surely indexes on the columns you join on are vital no?_________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; and the best companion is submission to Divine Will.
yes - indexes are vitaly on any large table accesses, especially joins. I think NR is trying to say that indexes are not the only means of improving performance. for example, if you needs 2 columns from a 10 column table, but the current select is a select *, then a covering index on all 10 column is not appropriate. Change the SQL to select on the 2 columns, and then the index on those 2 columns would get used....*##* *##* *##* *##* Chaos, Disorder and Panic ... my work is done here!--I wonder if my 100'th post will be in this thread? Thanks to you both. Now I understand the difference between clustered and non-clustered indexes. _________________________________________________________________________________________________________________________Inability is a disaster; patience is bravery; abstinence is a treasure, self-restraint is a shield; and the best companion is submission to Divine Will. |
 |
|
|
nr
SQLTeam MVY
12543 Posts |
Posted - 2006-08-17 : 12:56:52
|
| >> Thanks to you both. Now I understand the difference between clustered and non-clustered indexes. And probably now understand a lot more than most people that think they know the difference.==========================================Cursors are useful if you don't know sql.DTS can be used in a similar way.Beer is not cold and it isn't fizzy. |
 |
|
|
|
|
|
|
|