Article: The Case for the Surrogate Key

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

Site Related Forums

Article Discussion

Article: The Case for the Surrogate Key

Author

Topic

AskSQLTeam
Ask SQLTeam Question

0 Posts

Posted - 2002-08-09 : 12:17:58

There is a simple, inexpensive, uniform element of database design that you are likely avoiding in favor of a complex, costly, and inconsistent one. This element is the surrogate or substitute primary key. It seems that designers avoid these independent keys like the plague. Instead, all but the most basic business entities are given keys made up of some series of attributes, borrowed keys, and sequence numbers . It was only after experiencing many problems at the hand of intelligent keys that I, too, became a believer in the flexibility and stability afforded by the substitute key. This is a pretty good article on when you can use a surrogate key. And I thought we didn't get enough discussion on this from Rob's article.

Article Link.

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-08-09 : 13:09:19

okay....

quote:
Proper primary key design recognizes that there are two unique keys to be considered. Database designers make a critical mistake when they fail to recognize the distinction between these keys. The logical unique key distinguishes entity occurrences from others to avoid duplicates before entry into the database. That is why this key is typically composed of attributes and relationships that the business can recognize as distinguishing occurrences logically from others. The primary key, however, is a physical design element that uniquely identifies rows after they are in the database. It is used in creating foreign keys in other tables in order to establish the logical relationships in the data model. In fact, it is best to consider a primary key's primary purpose to be establishing relationships in the database!

This paragraphs quite eloquently explains the biggest reason why to NOT use surrogate keys. First and foremost, the horse puck about PKs existing in the physical design layer only can be quickly resolved by looking at rules 1 and 2 of Codd's 12...

quote:
Rule 1: The Information Rule

All information in a relational database is represented explicitly at the logical level in exactly one way—by values in tables.

Rule 2: Guaranteed Access Rule

Each and every datum (atomic value) in a relational database is guaranteed to be logically accessible by resorting to a table name, primary key value, and column name.

But that falsehood aside, if you already have a column (or composite) that uniquely identifies a member in your logical design (the first key mentioned), why do you need to contrive/dream up/invent a second surrogate key to do the same job? Reading this article, doesn't answer this question. Additionally, viewing the PK as a tool or attribute of a relationship has absolutely no relations theory ground to stand on. This entire article is "derive[d] from the definition of the primary key" which deviates quite radically from established Relational Theory.

No look at Figure three.

I have never developed in DB2, but it seems to me that foreign key relationship pictured Figure 3B between Dependent and Table A and Dependent and Table B would be impossible to inforce with a physical foreign key constraint. Maybe this isn't true for DB2, but sure is for SQL Server. Can you image the trigger you would need to write to enforce this? Can you image if the biz rules dicated adding a Table C later?

quote:
We would do well to remember that a smart key is a dumb thing to do, while a dumb key is a smart thing to do.

I have no idea what this means.

This article would have been much better if it has said word one about performance, because on a logical level, I think it hold no water. And on a physical levle, I think the decision rule must be performance.

IMHO, there is no such thing as a Surrogate Key on the logical layer. In fact, a Surrogate Key is purely a non-relational tool used in a physical implementation to make up for shortcomings in the DBMS software's ability to implement the Relational Model in a way that performs (...been reading a lot of Pascal lately, can you tell

). You never NEED a surrogate key. If you DO, you aren't normalized. If you get better performance out of a surrogate than a natural key, then feel free to adulterate the relational model. But by all means, try the natural key first. Don't believe this article and don't believe me. Investigate the issue for yourself. Test different options and see what works best for your physical implementation.

Jay White
{0}

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2002-08-22 : 14:37:51

OK, time to wade in on the subject... First, let me say that I used to think the relational theory was great for the academic world, but that I was doing pretty darn well in the corporate world without having a complete understanding of it. I still think I'm doing pretty well without fully understanding the underlying theory, but I have since come to appreciate the value it can provide thanks in large part to Page47 (Jay), setbasedisthetruepath (Jonathan), and Arnold Fribble. I have to say that Arnold's solution to the Yak Problem (RC3) REALLY opened my eyes.

And trying not to rehash the grand debate, let me just say that I am in favor of surrogate/meaningless keys in many situations. I cannot necessarily justify it based on performance, as Jay mentions here, but more out of convenience (or lack of knowledge). But, I have begun to move more toward the middle ground than I used to be, and am using meaningful keys more often.

With that said, I have trouble with this article, perhaps because of the examples used. First of all, I would rarely if ever use a Date field as part of a Primary Key, except for a calendar. I'd have to analyze the needs long and hard before using it in something like the customer orders table because it just seems so likely to me that somewhere down the road you're going to get more than one order per day from the same customer. Now, whether order should have its own single surrogate key, or a combination key of customer number and something like sequence number is debatable. Is one any more meaningful than the other? Is Customer ID actually meaningful? I'd argue against a pseudo-meaningful customer ID like "first four letters of customer name..." or something like that because the data that the rule is based on may change, and then what?

In fact, here's a real-life scenario where I'd argue a surrogate key may be very useful. I'm doing contract work for a city government. The government is broken up into numerous departments. For various systems we need to identify which department that records relate to. So, a table of Departments is needed. What should the key to this table be? One might first suggest the department's acronym (you know they LOVE acronyms here...) Well, if you look at recent history here, you'll see that departments often change names (and thus acronyms) with the change in mayor. In fact they may change names multiple times with the same mayor. So, because the department name changes so often, I'd argue that a meaningless key would be really useful here. This solves one problem, but still leaves another challenge which is the massive re-orgs that merge and divest departments causing additional headaches...

So, in the end, I am using meaningful keys more and more, but still believe strongly in the value of meaningless "surrogate" keys in many situations. And as Jay said...

quote:
Don't believe this article and don't believe me. Investigate the issue for yourself. Test different options and see what works best for your physical implementation.

Sitka
Aged Yak Warrior

571 Posts

Posted - 2002-08-24 : 17:36:29

Contemplating the business rule, workflow aspects of this
Figure(s) 1x and surrounding argument seemed valid to me.
I'm working in ERP/MRP and the concept of building a Work Order, Job,
Sales Order etc. seem to fall victim to getting out of sequence. Often is the case where peoples work get focused on getting a jump on what will be required in the near future. The model being that a Work Order for example needs a spot to begin, like maybe a 'ghost' customer on a guessed at date with a yet to be confirmed Work Order number. The implication is that there can be a body of work that goes into a unit (or set) that "IS" this future work order but then the logical (fabricated up to this point) key would get ripped out from under that set and reassigned. It seems in this case the surrogate key is the most natural fit.

Voted best SQL forum nickname...."Tutorial-D"

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2002-08-24 : 20:14:42

Jay,

Pascal and Date really produce some good memes..
One being that the difference between physical and logical is not trivial.

IMHO only use a surrogate when there is NO natural key.

The performance side for natural keys in production seems to be a hindrance when I would argue the opposite.

Here is a subset of the schema (Just the keys) I have for my farm.
Foriegn Key definitions omitted for brevity.


Create table Paddocks (Paddock varchar(50) Primary Key)

Create table Trees(Paddock varchar(50), Row int not null, Position
 int not null Contraint PK_Tree Primary Key (Paddock, Row, Position))

Create table Trees_History(Paddock varchar(50) not null, Row int not
 null, Position int not null, HistoryDate Datetime not null  
Contraint PK_Trees_History Primary Key (Paddock, Row, Position, HistoryDate))

Create table Trees_Work(Paddock varchar(50) not null, Row int not
 null, Position int not null, HistoryDate Datetime not null, Work 
varchar(50) not null  Contraint PK_Trees_Work 
Primary Key (Paddock, Row, Position, HistoryDate, Work))

On first look it looks like a lot of redundant data. The eky is getting bigger and bigger..Which leads to "needs more space, integers join faster.." thinking.

I hope it is not too hard to picture the surrogate version for this schema...

It needs more space physically there is no doubt, but which schema can answer the following question WITHOUT a single join?
Example.
"What Paddock has the most work done per tree.?"

Integers may join faster but not as fast as no joining at all.

quote:

Investigate the issue for yourself.

"Accept no argument from authority"
Carl Sagan

DavidM

"SQL-3 is an abomination.."

Edited by - byrmol on 08/24/2002 20:28:47

stevenbeales
Starting Member

1 Post

Posted - 2003-01-05 : 04:58:46

>>You never NEED a surrogate key. If you DO, you aren't normalized.

I believe that surrogate keys are ALWAYS the best choice for the following reasons - and that some form of GUID/UUID key (tuned for performance) is USUALLY the best choice for that key.

(i) In a physical database, it is best to consider the primary purpose of the primary key to be establishing relationships in the database.
(ii) Natural keys are exposed to end-users - this is the problem. What is a unique natural key? - let's take a company employee number for an employee table as a possibility - even when you've overcome the problem that some new employees might not have an employee number yet, there are still issues about people changing numbers and worse numbers being incorrectly entered and then updated at a later date. If you have joined all your tables based on this user entered data (the employee number), you will have to update all child tables that join using this number as a foreign key if the employee number is changed in the parent table and this is obviously not ideal - it is much preferable to have an ID field that is not exposed to the end user and have the employee number as an attribute - then if the employee number is updated, no other tables are changed.
(iii) Using GUIDs as keys means that if your application/database is successful and is used by many companies/users - all that data can be aggregated and put in a data warehouse with no extra data conversion -because all keys are globally unique.

Steven

Merkin
Funky Drop Bear Fearing SQL Dude!

4970 Posts

Posted - 2003-01-05 : 05:57:33

*sigh*

This is a popular topic lately.

Damian

Edited by - merkin on 01/05/2003 05:58:36

Onamuji
Aged Yak Warrior

504 Posts

Posted - 2003-01-06 : 10:11:46

scrap that, long tiring, junk. (i tried to model it that way and it was 20 tables, maybe I did something wrong, but I broke out everything into natural keys like everyone suggested)

what I really want to know is how to model a tree structure using logical normalization rules.

ROOT_01
 : CHILD_01
    : LEAF_01
 : CHILD_LEVEL_02
    : LEAF_01
 : LEAF_01
ROOT_02
 : LEAF_01

where each node has a name, and that name could be the same as another node in a different sub-branch. The biggest problem that I have so far is the ROOT nodes. They have no parent node.

Does the tree structure, a common structure found all over the real world, violate the rules of a natural key, or make it impossible to choose a natural key for this real world entity?

Examples of a tree structure: organization, table of contents, family tree.

Edited by - onamuji on 01/06/2003 10:58:29

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-01-06 : 11:01:38

Usually a root node will have a Null parent, or a placeholder value (like zero) that points to a non-existent parent. Another approach is to put the node's value as the parent; it's its own parent then. In either case you need to write logic into the code to process the root nodes properly, since they are essentially exceptions amongst nodes.

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2003-01-06 : 11:17:27

quote:

family tree.

Family trees aren't trees, they're labelled directed acyclic (barring accidents with time machines) graphs with outdegree 2 (barring Raelian loonies).

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-01-06 : 11:45:35

quote:

barring accidents with time machines

LOL

Those things mess everything up !!

- Jeff

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-01-06 : 12:09:47

Who needs a time machine? Inbreeding will FUBAR a family tree in no time!

Faye Dunaway: She's my daughter.
SLAP!
She's my sister.
SLAP!
She's my sister, my daughter...
SLAP!
Jack Nicholson: I SAID I WANT THE TRUTH!
Faye Dunaway: SHE'S MY SISTER ANNNNNND MY DAUGHTER!

How'd you like to model THAT ONE in a "relational" database?

Arnold Fribble
Yak-finder General

1961 Posts

Posted - 2003-01-06 : 12:30:43

Chinatown?

You're being fooled by the ambiguity of 'sister'. It could mean either full-sister (both parents the same) or half-sister (one parent the same).

In this case, Faye being her own daughter's full-sister would indeed give a loop... and imply that Faye is her own mother.
On the other hand, if Faye's daughter's father is her own father than she's her daughter's half-sister -- no loops.
Simple.

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2003-01-06 : 15:14:32

Onamuji,

A "proper" tree has only 1 ROOT. So if the thing that you are modelling has many roots, then you will have to "make up" a "Super Root" (Sound like a title to a porno film....)

If we take a look at 2 ways to model the "part explosion" problem, I think the natural key speaks for itself.

1) Adjacency List Method
Parts table
Part (Key)

PartsHierarchy table
Parent_Part (Key)
Child_Part (Key)

2) Nested Set Method
Parts table
Part (Key)

PartsHierarchy table
Part
Lft (Key)
Rgt (Key)

Both methods describe the position of the part in the hierarchy and I would have thought the natural key is its position in the hierarchy.

The Part and its immediate parent (Adjacency) or its position Lft and Rgt (Nested set)

In the Adjacency method, the ROOT part appears only in the Parent_Part column and never in the Child_Part column.

This is Celko territory.. you can find plenty of info on the web or in his books..

DavidM

"SQL-3 is an abomination.."

Onamuji
Aged Yak Warrior

504 Posts

Posted - 2003-01-06 : 16:55:55

but then if i have to have a super root (which isn't a problem) my foreign key on the parent still doesn't work :( and what if i'm the 5th level down in the tree (or up depending on which way your gravity goes) using the name of the parent when the names could be in two different sub branches but the structure only allows for one... argh... i guess relational dbs aren't perfect... i withdraw my bid for president...

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-01-06 : 17:19:25

I think you were referring to your earlier example (that you removed) where you had Computer->Server->Hardware->Server as a tree, correct? The thing is, although the name "Server" is the same, they are NOT the same node because they do not have the same parent in the tree. Additionally, that kind of lineage is sort of strange, unless you're using it strictly for illustration purposes. I'd question any kind of data hierarchy that has that kind of loop in it, unless you can provide a specific example that demonstrates it. I think that you might be able to categorize it differently to work around that too.

You can create a root node that has itself as its parent, even with a foreign key. If it doesn't INSERT properly, disable the constraint, insert the row and update the parent, then re-enable the foreign key. Everything you INSERT after that would work off the root.

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2003-01-06 : 17:40:25

I take it you are using the adjacency method..

The sample data you gave won't break the PK constraint. And I can't see how that would affect the FK constraints.

But I think you mean that you have repeating Parent-Child elements that appear in different trees. So the "Super Root" won't work and you will have to add an extra attribute like Tree_ID or something to distinguish individual trees...

Could you eleaborate?

I glad you said "relational dbs" and not "relational model"!

DavidM

"SQL-3 is an abomination.."

Onamuji
Aged Yak Warrior

504 Posts

Posted - 2003-01-06 : 19:41:12

the relational model is a solid fondation for data relationships. i'm once read something about an object-oriented model for database modeling.

but to get to the point here we go

Category
--Desktop
----Hardware
------Server
--------SUN
----Software
------Server
--------SUN

this is typical for our category listing, broken out by hardware and software then the type of part then another category, specifically (SUN hardware equipment and SUN software) (NT hardware base and NT software)

does that help?

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2003-01-06 : 21:03:06

As Rob pointed out the "SERVER" has a different parent so that should still be OK.

quote:
i'm once read something about an object-oriented model for database modeling.

At the moment there is no formal OO model of data so do some research first on the software in question. Date et al in "The Third Manifesto" keep asking what exactly these implementations really want. They come to the conclusion that they want type inheritance. They have devoted a substantial amount of ink to the topic and it is staggeringly complex. Believe it or not they go one to show that the relational model could handle this without any major revision. I hope I am alive when that product hits the market..

DavidM

"SQL-3 is an abomination.."

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-01-06 : 21:09:58

quote:
I hope I am alive when that product hits the market...

Well, if you can live until Yukon comes out, it should pretty much handle it.

If you believe the MS hype, naturally...

Onamuji
Aged Yak Warrior

504 Posts

Posted - 2003-01-07 : 08:52:32

one more thing... in another post I gave the code to make a Country, State, City, Zip type address set using natural keys. By the time you got to the Zip table you had the entire Country, State, and City names in the table. Now say I went one further:

CREATE TABLE StreetType ( -- This is things like Road, Street, Crescent, Drive, etc
Name NVARCHAR(32) NOT NULL PRIMARY KEY CLUSTERED,
Abbreviation NVARCHAR(4) NOT NULL
)

CREATE TABLE Street (
Type NVARCHAR(32) NOT NULL REFERENCES StreetType(Name),
Name NVARCHAR(128) NOT NULL,
PRIMARY KEY CLUSTERED( Type, Name )
)

Now that I have that structure... and now I need to compose a structure that combines everything... I'm concerned the further I go into the natural key depth, the larger my primary key is going to become since most of these things are between 32-128 NVARCHAR characters ... and the primary key can only be 900 bytes... I know this is a problem with the database engine, and not the model... so how do you overcome this, by using surrogate keys?

CREATE TABLE Address (
Line TINYINT NOT NULL, -- Used for multiple address lines (like PO box, apartment, etc)
Address NVARCHAR(32) NOT NULL,
Street_Type NVARCHAR(32) NOT NULL,
Street_Name NVARCHAR(128) NOT NULL,
City NVARCHAR(128) NOT NULL,
State NVARCHAR(128) NOT NULL,
ZipCode NCHAR(5) NOT NULL,
ZipCode_Extension NCHAR(4) NOT NULL,
Country NVARCHAR(128) NOT NULL,
PRIMARY KEY CLUSTERED (Line, Address, Street_Type, Street_Name, City, State, ZipCode, ZipCode_Extension, Country),
FOREIGN KEY (Street_Type, Street_Name) REFERENCES Street(Type, Name),
FOREIGN KEY (City, State, ZipCode, ZipCode_Extension, Country) REFERENCES ZipCode(City, State, ZipCode, Extension, Country)
)

Which i think is way over the 900 byte limit... just want to know your thoughts on this... just trying to explore this topic fully...

... and I have so far gained a boot full of wisdom thanks to you guys

-- Update
Oh and I took the time to find a complete listing of all countries and I know that country names have a maximum length of 32 characters currently (assuming the list I have is complete) as I'm sure state/province and city names are much shorter than 128 characters ... so maybe its not an issue but I could go one level deeper...

CREATE TABLE Vendor (
Name NVARCHAR(128) NOT NULL PRIMARY KEY CLUSTERED
)

CREATE TABLE VendorAddress (
Vendor NVARCHAR(128) NOT NULL REFERENCES Vendor(Name),
Line TINYINT NOT NULL,
Address NVARCHAR(32) NOT NULL,
Street_Type NVARCHAR(32) NOT NULL,
Street_Name NVARCHAR(128) NOT NULL,
City NVARCHAR(128) NOT NULL,
State NVARCHAR(128) NOT NULL,
ZipCode NCHAR(5) NOT NULL,
ZipCode_Extension NCHAR(4) NOT NULL,
Country NVARCHAR(128) NOT NULL,
PRIMARY KEY CLUSTERED (Vendor, Line, Address, Street_Type, Street_Name, City, State, ZipCode, ZipCode_Extension),
FOREIGN KEY (Line, Address, Street_Type, Street_Name, City, State, ZipCode, ZipCode_Extension, Country) REFERENCES Address(Line, Address, Street_Type, Street_Name, City, State, ZipCode, Extension, Country)
)

Edited by - onamuji on 01/07/2003 09:22:17

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources