An Argument Against IDENTITY - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

Old Forums

CLOSED - General SQL Server

An Argument Against IDENTITY

Author

Topic

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2003-03-21 : 11:24:34

The natural key / unnatural key debate has gone around and around. My intention here is not to regurgitate old arguments. Instead, I would like to present a new argument against unnatural keys and debate it only.

First, I want to clear up some terms. Joe Celko defines different types of keys in his article, Keys and History. By Joe’s definitions, SQL Server’s IDENTITY column would be a "uniqueifier" key. I’ve always call IDENTITY a surrogate. For this discussion, lets just call it identity.

The argument that I would like to make is this: an identity primary key can cause update anomalies and data inconsistencies. Specifically, by adding an identity to a table, you can falsely normalize a poorly designed table and mask data integrity issues.

Consider this database.


create table dbo.parts (
    part varchar(10) not null,
    constraint pk_parts primary key clustered (part) )

create table dbo.warehouses (
    warehouse varchar(10) not null,
    constraint pk_warehouse primary key clustered (warehouse) )

create table dbo.partlocations (
    part varchar(10) not null,
    warehouse varchar(10) not null,
    quantity int not null,
    fax varchar(20) not null,
    constraint pk_partlocatoins primary key clustered (part,warehouse),
    constraint fk_partlocations_part foreign key (part) references parts(part),
    constraint fk_partlocations_warehouse foreign key (part) references warehouses(warehouse) )
go

In what normal form is dbo.partlocations? It is in 1st normal form. It does have a primary key, but because of the partial dependency it does not satisfy 2nd normal form. Fax is a fact about a subset (warehouse) of the primary key. For completeness to my arguement, the fundamental problem with this is:

The fax number for the warehouse is repeated in every row that refers to a part stored in that warehouse.

If the fax number of the warehouse changes, every row referring to a part stored in that warehouse must be updated.

Because of the redundancy, the data might become inconsistent, with different records showing different fax numbers for the same warehouse.

If at some point in time there are not parts stored in the warehouse, there may be no row in which to keep the warehouse’s fax number

Now, consider this database.


create table dbo.parts (
    part varchar(10) not null,
    constraint pk_parts primary key clustered (part) )

create table dbo.warehouses (
    warehouse varchar(10) not null,
    constraint pk_warehouse primary key clustered (warehouse) )

create table dbo.partlocations (
    rowid int identity(1,1) not null,
    part varchar(10) not null,
    warehouse varchar(10) not null,
    quantity int not null,
    fax varchar(20) not null,
    constraint pk_partlocatoins primary key clustered (rowid),
    constraint fk_partlocations_part foreign key (part) references parts(part),
    constraint fk_partlocations_warehouse foreign key (part) references warehouses(warehouse) )
go

In what normal form is dbo.partlocations? It is in 3rd normal form*. By definition, there is a primary key and there are no partial or transitive dependencies. A database in 3rd normal form should minimize update anomalies and data inconsistencies. But in realitiy, we have added a contrived column to our data and obscured the data integrity issues inherent in our design.

So my argument is not that every identity column compromises data integrity, but rather that the use of identity columns can falsely normalize a table and give a false sense of protection against update anomalies and data inconsistencies.

*Note: those paying attention will argue that the identity is not part of the logical design and the normal form of the database is determined on the logical level. Therefore partlocation is in 1NF in the first version and is not a table in the second version because it has no primary key. I will submit to that argument in agreement. My post here is directed at those who would consider my second version of dbo.partlocatoins to be in 3NF. We could debate whether or not an identity can be a primary key, but that’s another thread.

Jay White
{0}

chadmat
The Chadinator

1974 Posts

Posted - 2003-03-21 : 15:24:31

Actually,

The second version is in 2NF, not 3rd. Fax (Non Key attribute) is dependent on Warehouse (Non Key Attribute).

-Chad

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2003-03-21 : 15:27:27

Well, that makes a good point.

Is it possible to have a table with an identity primary key in anything higher than 2nd?

Since its doubtful that a number (like 47) has any facts about it being stored, everything in the table will have a transitive dependency on the candidate key ...

Jay White
{0}

Onamuji
Aged Yak Warrior

504 Posts

Posted - 2003-03-21 : 15:35:23

we use the IDENTITY feature of a column to apply a business rule

CREATE TABLE PROJECTS (
PROJECT_ID INT NOT NULL IDENTITY(50000,1) PRIMARY KEY,
REQUESTER_ID INT NOT NULL REFERENCES USERS(USER_ID),
NAME VARCHAR(256) NOT NULL CHECK (LEN(NAME) > 0),
SCOPE TEXT NOT NULL
)

that is a business entity ... the project id is an incremental number starting at 50,000 ... it must be unique as not to confuse this project with another project ...

then a project has a requester (the person that created it) ... a title/name ... and a scope ...

how can you normalize this more than that ?

chadmat
The Chadinator

1974 Posts

Posted - 2003-03-21 : 15:35:52

Well,

It's been a while since I have had "academic" discussions like this, so I am liable to get my Normal Forms wrong. However, If you put fax in the warehouse table is it not then in (at least) 3NF?

Because quantity is not dependant on anything (except the combination of warehouse and part).

-Chad

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-03-21 : 15:45:17

quote:

Well,

It's been a while since I have had "academic" discussions like this, so I am liable to get my Normal Forms wrong. However, If you put fax in the warehouse table is it not then in (at least) 3NF?

Because quantity is not dependant on anything (except the combination of warehouse and part).

-Chad

Chad -- you spoke my mind word for word ... the true problem in either case is the "fax number" field in that table ...

- Jeff

aiken
Aged Yak Warrior

525 Posts

Posted - 2003-03-21 : 16:53:38

In the examples given, putting the identity key in the partlocations table is certainly a bad idea, but it would take a real newbie to do that.

However, I would personally use the following schema:


create table dbo.parts (
    i int identity,
    part varchar(10) not null,
    constraint pk_parts primary key clustered (i),
    constraint uk_parts unique (part) )

create table dbo.warehouses ( 
    i int identity,
    warehouse varchar(10) not null,
    fax varchar(20) not null,
    constraint pk_warehouses primary key clustered (i),
    constraint uk_warehouses unique (warehouse) )

create table dbo.partlocations (
    i_parts int not null,
    i_warehouses int not null,
    quantity int not null,
    constraint pk_partlocatoins primary key clustered (i_parts,i_warehouses),
    constraint fk_partlocations_part foreign key (i_parts) references parts(i),
    constraint fk_partlocations_warehouse foreign key (i_warehouses) references warehouses(i) )

...that way, part names and warehouse names can change without forcing an update of every partlocation row. Also, the fax number is properly associated with the warehouse and easy to get with a join.

So, while I guess I agree that it might be possible for someone to think that just adding an identity column to a table somehow normalizes it, I can't believe that anyone with real experience would make that mistake.

There are also times where an identity-based approach is mandatory. Consider this scenario: a web advertising system places banner ads on a wide variety of websites, and promotes a wide variety of websites. In order to track sales and clicks, the central system has to know which website provided a click, where the click went to, and when it happened. The promoted sites have to know the site the banner was run on in order to pay comissions.


create table banner_clicks (i int identity,
                            i_advertisers int NOT NULL,
                            i_sites int NOT NULL,
                            when datetime DEFAULT(GETDATE()))

...let's see *that* work without an identity (or other "uniqueifier" like a GUID).

I agree that blindly using identities everywhere is silly; if there's a natural primary key, that's the thing to use. However, especially for logging purposes, identities are often critical because no other field is guaranteed to be unique, and passing whole combinations of fields in and between applications is error-prone and inefficient (and sometimes there's no combination of fields guaranteed to be unique, anyways).

Cheers
-b

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2003-03-21 : 19:26:54

Jeff and Chad ... you two are obviously too good at solving problems

. Obviously, the designer should move fax to the warehouse table ... My point is simply that this is more obivious when using natural keys than it is when using identities. With the natural key, it is easy to identify the partial dependency. With the identity, as chadmat points out. The rules of normalization are tossed out the window anyway, and the partial dependency becomes obscured.

Onamuji, your table is at best only in 2nd normal form, as chadmat points out. You have a transitive dependency between scope and name. Name should be your primary key because of that. But you say, "but I could have two projects with the same name", to which I respont, "well, then, you don't have a table, because you don't have a primary key." But this is exactly the debate I didn't want to get into. Let's discuss my post and theory ... we can discuss yours in another thread.

The most valid arguement here is from aiken basically saying, 'Someone with experience wouldn't make that mistake.' I agree with you. But why is that? How do you teach that? The rules of normalization are documented, proven and teachable. Your rules are folklore and guesswork.

Plus, I'll never believe you can't find a natural key. If two rows are truely identical, then either you have poorly designed your database or you don't have two things to store ... you only have one.

Don't ask the question, if you know what my answer is going to be already

...

Jay White
{0}

aiken
Aged Yak Warrior

525 Posts

Posted - 2003-03-22 : 00:08:09

quote:
Plus, I'll never believe you can't find a natural key. If two rows are truely identical, then either you have poorly designed your database or you don't have two things to store ... you only have one.

I've got to disagree with you here. 90% of the time, that's true -- it's better to have some kind of "counter" column.

However, especially for logging applications, there are times when you get identical rows. For instance, a high volume website may very well get two pageloads of the same page from users with the same web browser at the same millisecond from the same AOL proxy server IP address. It's not accurate to throw one of them out.

Even without getting that extreme, who'd want a primary key that's on a 15 byte varchar IP address, 8 byte datetime, 255 byte HTTP_AGENT, and 40 byte varchar for pagename? Especially if, for some reason, you had to refer to that row elsewhere -- like, perhaps, for error logging. Why duplicate 318 bytes as a key?

I'll stick by my guns: identity keys can be abused, but they are often necessary, depending on the type of data being stored.

Cheers
-b

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-03-22 : 00:26:08

Just a quick comment:

quote:
that way, part names and warehouse names can change without forcing an update of every partlocation row

The problem with that statement is that the ability to change a non-key attribute like warehouse name is considered to be a convenience or necessity, and the usual screaming argument I'd make is "but it's NOT THE SAME WAREHOUSE THEN!!!!" While that may not strictly be true, the greater problem is illustrated in the OTHER thread recently dredging up identity, about having an engineer or consultant be able to change their name too.

Well, here's a problem: Elizabeth Taylor can change her name to Captain Kangaroo, but doing so will not MAKE HER Captain Kangaroo. That's a huge problem! Is it just a name change, or is Captain Kangaroo taking over Liz's sales of White Diamonds? These are two completely different situations, and the database NOT ONLY ALLOWS IT but CAN'T TELL THE DIFFERENCE BETWEEN THEM. It's one thing to permit the changing of address, phone number, email address and such, but there needs to be some greater consideration of which attributes/columns SHOULD BE ALLOWED to change.

An even greater illustration of how that could really FUBAR your data: suppose Captain Kangaroo already was in the database under another ID? What's the scoop then? Did he clone himself? Is Liz still Liz but calling herself Captain? Did they both achieve such a level of spiritual union that they are now blissfully one entity, one mind, one soul, one spirit, and the stupid DBA just forgot to delete the dupe row? You can't tell, and it's all because you allowed the name column to be updated. And let's not talk about what happens to Liz's sales and commission history if cascade update is turned on...hell, even if it's off...the commissions now go to Bob Keeshan, and always HAVE.

If anyone thinks that these are minor concerns that rarely happen, I offer the immortal words of Santa Claus: HO HO HO! Stick around the database world another 15 minutes or so, and you'll see how often it really DOES happen, and how hard it is to fix it.

Joe Celko (I think it was him) had quoted an old saying in one of his books about the family ax: it was handed down from one generation to another, and while the handle had been changed several times, and the head replaced too, it was still the same old family ax!

It begs the question: if everything else except the key can be changed, what exactly is that key identifying anyway?

Well, so much for the QUICK comment.

aiken
Aged Yak Warrior

525 Posts

Posted - 2003-03-22 : 01:29:54

quote:
suppose Captain Kangaroo already was in the database under another ID? What's the scoop then? Did he clone himself? Is Liz still Liz but calling herself Captain?

In this scenario, yes, Liz is still Liz but *is* calling herself captain.

If you're ever worked with a personnel database, you know it's not at all uncommon for peoples' names to change. Would you really want to update every sales record, payroll record, phone record, etc... just because someone in the company got married?

That's a perfect case for identity records. That way you've got "employee #1234", and they are *always* employee #1234, regardless of whether they change their name to Captain Kangaroo, move to another city, or have a sex change and then get married.

As for the "well, is it still the same _____?" argument: I cannot *tell* you how many times I've seen branch office names or part numbers change because of merger / acquisition / business change. They are absolutely the same thing, they just need to be called something else from now on.

There are other ways to do it -- keep the names the same and use a simple table to map "warehouse name" to "current warehouse name" -- but there's a lot to be said for *not* tying the primary key to something like a name that can (and will, eventually) change.

Cheers
-b

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-03-22 : 08:35:59

quote:
In this scenario, yes, Liz is still Liz but *is* calling herself captain.

Unfortunately, I wasn't actually asking the question, I was pointing out that as illogical as it seems, Captain Kangaroo CAN clone himself, and the CAN merge with Liz into one being, and there's nothing in your data that indicates what happened. I'm not debating whether using a surrogate key makes it easier to do that, I'm questioning the wisdom of using that as a validation to use a surrogate key.

YOU know that when an employee gets married and changes their name that they're the same person, BUT some degenerate database corrupter could simply change the data indiscriminantly, and what looks like a name change is really a personnel reassignment (which may or may not be permitted, according to the "business rules"...they do exist outside of the computer too!)

It has nothing to do with making it easy to change someone's name, that's not the problem; it's that allowing such a feature allows someone to corrupt the data as well. Simple test:

UPDATE Personnel SET FirstName='John', LastName='Smith'

You can't tell me that being able to run that is a desired feature. Everyone has run at least one unwanted UPDATE without a WHERE clause and couldn't restore a backup to fix it.

Since name is not part of a key anymore:

quote:
So my argument is not that every identity column compromises data integrity, but rather that the use of identity columns can falsely normalize a table and give a false sense of protection against update anomalies and data inconsistencies

This last example demonstrates perfectly what Jay is talking about. Most designers simply add the identity column, sing the praises of its ability to allow changes without having to redo everything, and then rip their hair out months or years later when they have tons of dupes and garbage data. Take a look at all the posts on SQL Team for "how can I removed duplicates"? You'll find NOT ONE of them used some kind of natural key, and 99% used identity or GUID.

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2003-03-22 : 10:10:01

quote:
but there's a lot to be said for *not* tying the primary key to something like a name that can (and will, eventually) change.

Couldn't agree more. A fact that is likely to change is not a good choice for a natural key. A fact that might change may be an acceptable choice if the change can be managed and integrity can be enforce.

Key selection is probably the most important part of database design (physical or logical). That's why we have normalization. There are specific documented rules for design. Using an identity column is tantamount to saying, "Codd be damned ... I don't need these stinking rules ... I too smart to make mistakes." That may be true for some, but I think as a field, we should push people to learn and practice normalizatin first, and cowboy design once they understand and can apply the theory.

Jay White
{0}

Onamuji
Aged Yak Warrior

504 Posts

Posted - 2003-03-22 : 11:56:19

isn't the IDENTITY property of an INT data field just a SQL 2k thing for the SEQUENCE feature or SQL-92(? i think that's the one) ... I believe its going to be usable in yukon ... sometimes business processes define sequential numbering of items ... project numbers, appropriation request number, purchase order number ... same thing so what if a number increments at some given interval ... if the number is part of the business process and not something you make up to keep track of rows in a database ... that's fine ... ... morning caffine craze...

aiken
Aged Yak Warrior

525 Posts

Posted - 2003-03-22 : 13:24:59

quote:
UPDATE Personnel SET FirstName='John', LastName='Smith'

You can't tell me that being able to run that is a desired feature. Everyone has run at least one unwanted UPDATE without a WHERE clause and couldn't restore a backup to fix it.

Oh, I hear ya. However, this is just as strong an argument for only allowign data access through stored procedures as it is for your actual point

. Besides, even using the natural key, unwanted updates can happen, just not on the primary key. If you say they key is FirstName,LastName, there's nothing to keep someone from running

update personnel set phone_number='2125551212'

...so I'm not sure that the "natural key as a protection against bad updates" argument holds water

quote:
Key selection is probably the most important part of database design (physical or logical). That's why we have normalization. There are specific documented rules for design. Using an identity column is tantamount to saying, "Codd be damned ... I don't need these stinking rules ... I too smart to make mistakes." That may be true for some, but I think as a field, we should push people to learn and practice normalizatin first, and cowboy design once they understand and can apply the theory.

Ok, I strongly agree with your statement about key selection, but I still disagree with you conclusion you get from it. But, hey, I've got an open mind. So please help me see the light by explaining how you'd use a natural key in the banner logging scenario, where the table (minus identity) looks like this:

create table banner_clicks (when datetime defailt(getdate()),
              banner id int not null,
              ip_address varchar(15) not null)

Feel free to add other common logging columns (HTTP_USER_AGENT, etc). The idea is that you need to log banner clicks and have a way to refer back to the click that generated a sale for analysis purposes.

Remember that, especially with AOL, it's completely possible that two different people will click on the same banner at the same millisecond from the same IP address (and potentially the same browser, etc). Most of the time that won't be the case, of course, but on a busy site, it *will* happen.

Even if you change the scenario and add some kind of user ID ("the person has to be logged in"), it's *still* possible that two people will be using the same user ID to do the same thing at the same time. They shouldn't be, of course, but you can't let that kind of funky external behavior cause DB problems or corrupt sales data.

I'm definitely open to seeing another way here, but so far, it just doesn't seem right... at least, not for every case.

Cheers
-b

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-03-22 : 14:02:01

I'll throw even more fuel on this fire: why have a key on that table AT ALL???? Disregarding the relational heresy of that statemet for a moment

lets look at the facts. It's logging events and storing certain attributes of those events. The events happened, even if you can't identify them specifically. The fact that 2 or more AOL'ers hit the same banner at the same time routed through the same server doesn't negate the fact that 2 clicks occurred.

I'm trying to see what value an added identifying column would have for that table. You could say they're dupes, but they really aren't. In any case I wouldn't care if I can uniquely identify them, and I can't do it anyway, even with an identity or GUID! I'll never join that identifier to anything else, or use it to populate another table. All of the the attributes I'm interested in are in non-key columns.

This is a classic heap table, otherwise known as a file. Technically you could use a SQL table for this, but ultimately it's just a file, and if you used fixed-length columns and removed the data page header info, you could do a flat sector copy from the hard disk and get the same result as a flat text file of the same data. You'd get some benefit by indexing it for searching and querying, but it's pretty pointless to insist that every row of this table be uniquely identifiable. Adding an identifier to this table will only waste space. Take a look at application log files and see (SQL Server log file, perfect example); there's nothing in it to uniquely identify anything, and you don't feel the loss.

In fact, this table would be SECONDARY to the indexes on it! It would literally only serve to keep the index key(s) up to date and searchable. The optimizer would be better off using only the indexes and never accessing this "table" in any way, and it's no great trick to get SQL Server to do so.

How about that? An index card that contains more useful information than the book it's cataloging?

Edited by - robvolk on 03/22/2003 14:05:46

nr
SQLTeam MVY

12543 Posts

Posted - 2003-03-22 : 21:24:43

I'm too inebreated to read this thread.

There are cases where the tables have data which form candidate keys. These should usually be defined as unique indexes. There are cases when this is a bad idea but ...
There are also cases where tables do not have data which form candidate keys. In this situation an artificial key should be defined (identity or whatever) - otherwise you will probably have access problems - and it isn't a database (by some definitions).

Now pointing out a situation where identities are a good idea or a bad idea does not advance the argument - it just covers that situation.

There is also a difference between the logical and physical model. You can have a candidate key which can't be implemented as a unique index due to performance constraints. Also you can have a situation where an identity could not have a unique index and a guid would slow donwn processing too much. OK the latter sentence is unlikely to be true but is theoretically possible.
If you think about it I am sure you can think of examples.

Anyway to sum up.
Identities/artificial keys - sometimes necessary, sometimes bad.
Natural keys - usually possible, sometimes bad.

A database on which every table has an identity would be a concern - but it would just complicate queries - the entity relationships could still be valid. And it could just be a valid representation of the analysis.
You could say the analysis is inadequate or the domain poorly defined - but these are implementational constraints.

Bottom line - can the system work.

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2003-03-22 : 21:46:57

I have one gripe here...

The notion of a non-changing Primary Key. It is obvious that this started from the fact that early implementations of RDBMS had no notion of transition (Cascading) transactions. It is actually a requirement that all attributes MUST be updateable to be considered a relation. Mind you, you may a business rule that forbids a particular attribute to be updated, but that is an extra constraint that should be added to your business model.

Since SQL 2K, this argument (except in some cases ie circular references), IMHO is wrong. In most real-world cases, a Primary Key update is usually considered an extra-ordinary circumstance and hopefully, you have will some kind of confirmation step...

"Are you really sure you want to update this Licence Plate Number"..YES... "Are you really, really sure?" YES Please

Personally, when it comes to key selection, I only consider uniqueness*

*Actually implementation subject to change with further requirements/consideration

DavidM

"SQL-3 is an abomination.."

nr
SQLTeam MVY

12543 Posts

Posted - 2003-03-22 : 22:24:04

Still under the influence but...

A primary key identifies a row. That's what differentiates a alternate key from a primary key.
Therefore to update a primary key you must delete the row and insert a new one.

This is a personel definition of a primary key but if you allow them to update then it can complicate things.
How would you write an audit trail trigger if you can't join the inserted to deleted tables on the primay key? Suspect you would have to audit trail every update as an delete and insert and allow for similar operations in triggers.

==========================================
Cursors are useful if you don't know sql.
DTS can be used in a similar way.
Beer is not cold and it isn't fizzy.

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2003-03-22 : 23:56:35

quote:

Therefore to update a primary key you must delete the row and insert a new one.

That is the only way to update a Primary Key. Thankfully, SQL Server does this automagically without firing the respective insert and delete triggers (If they exist)

Auditing only becomes a problem in multi-row Updates, but it is easy enough to stop multi-row PK updates while still allowing multi-row non-PK updates.

DavidM

"SQL-3 is an abomination.."

Onamuji
Aged Yak Warrior

504 Posts

Posted - 2003-03-23 : 09:42:21

shouldn't the IDENTITY property of a column be looked at in the same regards as a CHECK constraint ... its a method to constrain the database to business rules and processes ... though people abuse it ... that's their fault and problem, not the existance of the IDENTITY property of the column ... though I will agree there are times it shouldn't be used, however, no logical design should contain a sequence of numbers (IDENTITY) unless it is in support of a business rule and/or process (my companies service request numbers are sequential and unique, it is a rule and a process, so in my logical design i used the IDENTITY to constraint the data to this)

if everything was logical, women (my wife) would be in lots of trouble

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources