Return to Identity and Primary Keys
Identity and Primary Keys
Written by Rob Volk on 28 February 2001
If you are an Access user, or migrated from Access to SQL Server, you may think that a primary key and an Identity/Autonumber column are the same thing. Well, they’re not, and there are some very good reasons NOT to use an Identity column for the primary key on your tables.
I'll cover what's bad about Identity and show some examples of alternate key candidates. In Part II I'll go over instances where Identity isn't a bad choice--meaning there is really no other choice. I'll be using the term Identity in these articles, but all of these arguments apply to GUID columns as well.
A key is a value or combination of values that is unique for every row in a table. The difference between primary and ordinary keys is that there can be multiple keys, but only one primary key (think of all the things you like about your girlfriend/boyfriend, then pick the one you like most). You can use any data type or combination in a primary key, so long as the column(s) involved are declared NOT NULL (an index is automatically created on the primary key). Keys are also used to create and enforce relationships between tables; these are known as foreign keys. Foreign keys in the referenced (parent) table must be unique, and therefore are superb candidates as primary key.
Identity (or AutoNumber) is a column that automatically generates numeric values. You can set a start and increment value, but most people leave these at 1. A GUID column also generates numbers, but you cannot control its values. Identity/GUID columns do not need to be indexed.
An index is a structure in a table that orders the data. It allows the database to access data quickly. The two major kinds of index are clustered and non-clustered; clustered indexes physically store the data in the same order as the index. There can only be one clustered index per table, because the data can only be physically ordered one way (think of page numbers in a book, or names in the phone book). Non-clustered indexes maintain order on the index page only (just like the index in the back of a book), and provide a pointer to the page where the data resides. Indexes will be covered in more detail in Part II; for now they won't affect the choice of primary key.
Identity columns are bad because...
- They're not standard SQL. Most products have it but there's no consistent implementation.
- They can't be updated. This violates the relational data model (not fatal, but not good either). Duplicates can be accidentally inserted (fatal).
- They only create numeric values. GUID/NewID() are also numeric only, and are hard to read.
- Numeric values are not meaningful in many tables, and adding them complicates relationships between other tables.
Just to be fair...
- Plug-and-play key generation, much easier than writing an algorithm to do it.
- Small data space for a wide range of values.
- When ALL other key candidates are exhausted, Identity can be a lifesaver, especially when relating tables.
A quick rant about Access: I like it and I think it is a great program (hold on, the rant is coming), but it tends to do things for you that you should do for yourself. If you have not created a primary key on your table, DON'T let the computer add one. I think this is Access' worst feature; it scares newbies and misinforms them about primary keys. They get into the habit of adding an AutoNumber when they don't need it (I was one of them).
Many people have tables with some kind of ID column, and yet they still have unwanted duplicates in the table (removing dupes is a VERY POPULAR post in the SQLTeam forums). Lots of people forget that you can have more than one column in a key. If you have a table like this, use this "natural key" as the primary key instead of adding an Identity column. If Identity can't prevent dupes, don't use it.
By the way, you don't absolutely NEED a primary key. This sounds contrary to Joe Celko's statement "If you don’t have a key, you don’t have a table", but it isn't. He's more concerned about the relational definition of a table -- it requires every row to be uniquely identified by its primary key. If the data in your table doesn't guarantee uniqueness, you should add a column that will.
Duhhhh, then why not add an Identity column?
All it does is add a column of arbitrary and useless numbers, and there are plenty of data models that don't need 'em – imagine...Shakespeare picks up a flower catalog, and writes the immortal phrase, "A #154783 by any other name, uh, number, smells just as sweet." For something like this catalog, use Name AND Color to distinguish between roses. Or add their Latin classification. These may seem like arbitrary values, but they're data with MEANING, especially outside of this one catalog.
How about a real-world example?
This post is a good example of tables that don't really need an Identity column. It beautifully illustrates how Identity actually causes more problems than it solves. As another example, I have a simple Employees table at my job (it incorporates an old Excel phone list but adds more people to it):
|1||26||555-555-1212||Rob||Volk||SQL Stuff||Aged SQL Warrior||rvolk||jsmith|
|3||107||555-555-2112||Neil||Peart||Rhythm Section||Drum God||npeart||brich|
|5||NULL||000-000-0000||Jason||Newsted||Rhythm Section||Damaged Bass Player||jnewsted||NULL|
|6||99||NULL||Jodi Ann||Paterson||Fantasy||Playmate of the Year||jpaterso||hhefner|
|9||1||NULL||Ashley||Judd||Fantasy||Rob’s Dream Girl||yummygal||rvolk|
This second table (WebPages) is for employee web bookmarks; they can store these and access them from any computer:
Back to the Employees table. The only numeric column in it is the phone number (the ID columns are here for illustration only). Why not use this as primary key? Not everyone has a phone, sadly -- I wish they did! :) Names won't work: Mr. Newsted didn't leave Metallica to be an accountant, it's really some other guy. In the real table, there actually are two guys named Kevin Jones -- AND they’re both in Sales, so adding Department to the key wouldn’t help. Let's assume for now that I added the ID Identity column as my primary key; I'll come back to it in a second.
Let's do a quick exercise using these two tables. First, in the Employees table, using only the ID and BossID columns, identify each person's boss. Can't do it, huh? That was deliberate. Now in both tables, using the ID column only, identify which bookmark belongs to whom. Possible, but not easy. My point is to reiterate the meaningless nature of an Identity column's values -- imagine doing this with GUID's instead!
When I first put the Employees table together, there was no good primary key candidate, and I WAS going to use Identity until someone mentioned our company's Novell LAN user names. I didn’t need it in my table, but it turned out to be handy: they had to be unique for each user, everyone in the company gets one, and it's small (8 characters) and works as a single column. Now try those exercises using the Lan and BossLan columns. You'll notice a pattern to the values in the Lan column. It's descriptive enough for you to figure out who's who in WebPages without joining it to Employees.
Just for fun, try to figure out the primary key of the WebPages table. It's not a big challenge.
OK Teach, I will never use Identity again...
As promised, I will show some examples where Identity DOES make sense as a primary key, in Part II of this article. I can't stress enough that this should be your last option when designing your tables. It may not seem worth the effort to go to such lengths, but to paraphrase Ron Soukup, former SQL Server project manager, “Identity primary keys are for people who believe there’s never time to design a table right but there’s always time to do it over.”
Addendum - There will not be a Part II to this article. There are numerous discussions about using Identity as a primary key in the Comments section of this article here, if you are interested I recommend you read them by clicking the Comments link below.