Clustered Index vs. Full Text Index

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

General SQL Server Forums

New to SQL Server Programming

Clustered Index vs. Full Text Index

Author

Topic

Goldmember
Starting Member

9 Posts

Posted - 2008-06-18 : 21:18:58

Quick question about the primary purpose of Full Text Index vs. Clustered Index.

The Full Text Index has the purpose of being accessible outside of the database so users can query the tables and columns it needs while being linked to other databases and tables within the SQL Server instance.
Is the Full Text Index similar to the global variable in programming where the scope lies outside of the tables and database itself?

I understand the clustered index is created for each table and most likely accessed within the user schema who have access to the database.

Is this correct?

I am kind of confused on why you would use full text index as opposed to clustered index.

Thank you
Goldmember

GilaMonster
Master Smack Fu Yak Hacker

4507 Posts

Posted - 2008-06-19 : 03:14:00

Full text indexes enable you do do partial string matches very quickly. They allow things like searching for words or phrases, searching for inflectional terms (search for drive returns 'drive', 'drove', 'drives'), searching for synonyms (if you have a thesaurus file). It allows you to do weighted searching like search engines do.

--
Gail Shaw

jggtz
Starting Member

32 Posts

Posted - 2008-06-19 : 13:07:40

Also, something about Clustered Index (From : Brian Sherwin's Blog )

When you create a new table in SQL Server, you allocate some space to begin storing your data. This data goes on to 8K data pages. This means that our records will fill as much as they can the 8000 bytes, and then we will allocate another data page. Now this becomes important because when SQL has to read data pages from the hard drive, it becomes expensive in terms of I/O.

Now as you begin adding records, we will start grabbing individual pages (until you have 8 total data pages and then we start grabbing by extents-64k at a time). At this point, your data is stored in what is called a heap. Your records are stored where they fit in the data pages and in no particular order. Therefore, when your query requests data, SQL performs a table scan to retrieve your data. This table scan simply reads every single row, of every single data page for any record that matches your criteria. If you have a million rows of data, this might take a while.

Enter the index.

So now you decide you are going to create an index on your table. At this point, you look over your data and try to find something that will limit the rows of data that SQL Server will have to read in order to find your record. Several strategies should come into play (though most people just take a “hunt and peck” type approach or a “gut feel”) when determining your indexes. Some questions that should be asked:

How selective is the data I am searching--how unique is the data I am looking for

How dense is the data I am searching--how many times does the data repeat (duplicates)

Depending on the answers to these two questions, will determine which index SQL Server will use, if any.

So when we create an index, hopefully we have a fairly selective column that will help us find the exact record we are looking for. Since we will talk about clustered indexes in a minute, we will first assume a non-clustered index. With a non-clustered index, all we are really creating is a B-Tree structure on top of our heap. The pages in our index are the same size (8k) pages we have for our data. The leaf level pages in our non-clustered index are actually pointers into the heap. If we are looking for a single row, we can walk through the B-Tree structure to find the exact data page we need. By doing this, we reduce the number of data pages we have to read, and we see some savings in the time to produce the results of our query.

However, what if we are searching a range of data with a "BETWEEN" operator in SQL. In this case, SQL Server will find the lower bound item by using the B-Tree and use the pointer to get to the actual data page. Subsequently, SQL Server will walk through the index's leaf pages to the upper bound of the query. The problem here is that because our data may be spread out over lots of different data pages, the physical I/O could still be quite high. Depending on the selectivity of our query (as determined by the statistics that SQL Server gathers), the query optimizer may not decide it is cheaper to still do a table scan to answer the query.

Enter the Clustered Index.

OK, so now we are doing range type searching using the "BETWEEN" operator. What if we could order the actual data pages by the data we are performing the range search on? This way, we could use the B-Tree to get to the first record, and from there read sequentially through the data pages to the last record. Interestingly enough, this is exactly what the Clustered Index does. If gives us an Ordered Heap. The data is physically stored in the order of the clustering key. Finally, SQL Server builds a B-Tree structure on top of the newly structured data with pointers to the data page.

Now, the thing to keep in mind (that I alluded to in an earlier post) is that the clustering key should be something that builds sequentially. It should also have something to do with how you go after the data.

A last name makes a horrible clustering key. The reason is that as the data builds, it will keep the data in order. Now imaging you've created a telephone directory. Assume that every page is full from top to bottom. At the last minute, someone finds a contact card that fell under the desk. If you have to insert this person with the last name "O'Hara", this will affect all of the records that come after it. Now you have to re-layout the data on all of the following pages. In a similar fashion, SQL Server moves some data around to accommodate the change. Since SQL Server is interested in performance, it doesn't change all the data pages, but it does split the data pages into two an inserts the record in the correct place. Now, if this happens over the course of say one million records, you may see a lot of page splits which leads to database fragmentation. Now rather than looking at a data page that is full of data, we look at a lot of pages that are half full. For OLTP systems this fragmentation is good. For reporting or analytics, this fragmentation can be bad.

A couple examples of good clustering keys would be dates (if frequently searched), Identity columns (auto-incrementing). In this case the data is always appended to the end of the database, and the pages with add data until they are full. Page splits will be very rare.

Implications of Using a Clustered Index.

First, if we are ordering the physical storage of the data based on the clustering key, we can now understand the rule that there can only be one Clustered Index on any table (it can only be ordered one way, right?).

Second, since we have now ordered the heap, all of the pointers in any existing non-clustered index must be changed. Rather than creating a pointer to the data page, the non-clustered index will point to the clustering key. Now when you perform a query that uses a non-clustered index, SQL Server will first walk the B-Tree of the non-clustered index to find the clustering key value. SQL Server will then walk the B-Tree of the clustered index to find the physical data page.

Implications of Changing the Clustered Index

With the above in mind, some things to keep in mind:

When you build a clustered index, always build it first. Since your non-clustered indexes will be built on the clustered, if you build a non-clustered first, SQL Server will end up building it twice.

If you change a clustered index, the data will be reordered to match the clustering key.

When you build a clustered index, generally, you will want one and a half times your data available in free space. Why? First, SQL Server performs the indexing as an online operation--meaning the users can continue to use the data while indexing is being performed. Therefore, SQL Server will make a copy of the data in the new structure (the one). Second, since you are building a B-Tree structure on top of that, you will need some additional space (and a half) to accommodate this. Once the index is built, SQL Server will free up the space used by the original data.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources