Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2000 Forums
 SQL Server Development (2000)
 Simple Data modelling question....

Author  Topic 

JozzaTheWick
Starting Member

16 Posts

Posted - 2002-12-30 : 15:41:55

Hey everyone. I'm in the process of designing a database system that will log client tranascations (~ 1 a week per client, about 10 clients). I'd like some input on my data model so far.

Clients make transactions (e.g. buy & sell beer). Some client tranasctions (let's say about 25%) are made for specific locations that they own (e.g. client buys beer for the Slug & Lettuce), and the for the rest a location is not specified (e.g. client buys beer for all of their pubs). My transaction table currently looks something like this:

transaction_id
client_id
location_id
<transaction details>

I also have a clients table:

client_id
client_name

and a locations table:

location_id
client_id
location_name

I don't want to have any nulls in the 'location_id' in the transactions table, so I am considering adding an 'Unspecified' location to the locations table for each client and entering this as the location whenever it is not specified. This seems to solve the problem, but it does feel a little awkward, like I'm using the 'location' for two purposes. What do people think? How have others addressed this problem? Thanks in advance.

cheers,

J.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2002-12-30 : 16:03:47
I think it's a great idea to use "unspecified".

- Jeff
Go to Top of Page

GreatInca
Posting Yak Master

102 Posts

Posted - 2002-12-30 : 20:10:19
I like 'unspecified' I use PK value 0 for unspecified as blanks convert to 0. Keeps them nasty nulls out of the DB. Greatly reduces poor performing pouter joins / IS NULL .

Go to Top of Page

r937
Posting Yak Master

112 Posts

Posted - 2002-12-31 : 15:14:48
JozzaTheWick, can you please clarify why you don't want to have any nulls in the location_id?

GreatInca, can you please explain how outer joins perform poorly when the left table contains nulls in the join column?

rudy
http://rudy.ca/
Go to Top of Page

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2002-12-31 : 21:50:24
J,

A slightly different angle....

A few suggestions...(based on some assumptions)

1) Is the Client_Name unique? If so, lose the identity column.
2) Remove the Location_ID identity column and make the Client_Name and Location_Name the key.
3) Split the transaction table into 2 tables.

TransactionDetail table
TransactionID (Key)
Client_Name
<transaction details>

TransactionLocation table
Transaction_ID (Key)
Client_Name
Location_Name

No nulls, no joins, less disk space, not "awkward" and no middle tier interpretation needed.

DavidM

"SQL-3 is an abomination.."
Go to Top of Page

steelkilt
Constraint Violating Yak Guru

255 Posts

Posted - 2003-01-02 : 16:39:52
I wonder same about NULLs in this case. To me, NULL functions as "unspecified", i.e. the person entering the data did not have valid data to enter for this field, thus NULL works.

Go to Top of Page

X002548
Not Just a Number

15586 Posts

Posted - 2003-01-02 : 16:52:19
What you're describing is something called a non-identifying relationship...you make the location id null

Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-01-02 : 17:02:42
Note the differences between:

Null -- the data entry person had no idea what was correct so left it blank

Null -- the data entry person forgot to enter data in the field

Null -- the transaction truly has no location

Null -- a left outer join in a query is not returning a match from the table

Thus, you can see why you may wish to include a location_id that is "unspecified". It forces a meaniful value into the field and eliminates the need for outer joins.

It makes your reporting and querying easier; no OUTER JOINS, no dropping out records accidentally because you used INNER JOINS, no ISNULL(LocationDescription,'Other') as LocationDescription type formulas, etc.

The only time I really like to use Nulls as a valid entry are description-type fields or Date fields.

- Jeff
Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-01-02 : 17:11:50
quote:

1) Is the Client_Name unique? If so, lose the identity column.



Yikes. You really shouldn't use a long varchar() field as a primary key; any links to the client table will have to include the entire client name! Also, what if the client name changes? What if donwn the road two clients have the same name? Sorry to sound like i am bashing you, but it is really a very bad idea to use client_name as a primary key of a table.

quote:

2) Remove the Location_ID identity column and make the Client_Name and Location_Name the key.



Likewise for the above. If only for data storage and querying speed alone! An int takes 4 bytes, a varchar(50) takes 50 bytes. Never mind storing the location name as well, which is probably even longer!

quote:

3) Split the transaction table into 2 tables.

TransactionDetail table
TransactionID (Key)
Client_Name
<transaction details>

TransactionLocation table
Transaction_ID (Key)
Client_Name
Location_Name

No nulls, no joins, less disk space, not "awkward" and no middle tier interpretation needed.



No joins? How do you get the location of a transaction along with the rest of the transaction's data?

If a transaction has a location, it belongs with the transaction record unless something OTHER than that particular transaction determines the location. For example, if it is based on the client, it should be in the client table.

Also, each transaction has the full client name (most likely twice per transaction because you split it into two tables) as well as a copy of the location name?

Again, I sincerely apologize for bashing your post, but it really isn't very sound advice.

- Jeff

Edited by - jsmith8858 on 01/02/2003 17:13:38
Go to Top of Page

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-01-02 : 17:51:47
Well Jeff, I don't mean to bash your post either, but your arguments aren't entirely persuasive either. You're concerned with using extra space in a varchar column that would belong in the table anyway, and then suggest ADDING another column that uses more space (???) as well as its corresponding index(es).

What if a client changes their name, you ask? Well, what if someone inserts the same client twice, or three times, or updates another client to have the same name as an existing client? You'd need to have a unique index/constraint on the client name anyway, in addition to another int column that is primary key. Now you've got two unique constraints, and one of them is completely wasted. Ditto for multi-columns keys. Without a unique constraint on that they'll get dupes very easily, and good luck cleaning them up afterwards.

Splitting the locations out into another table is a great idea for this situation. The 4 instances you listed would not be problems at all. The first two are data entry issues anyway and do not compromise the data integrity of either table. And the last two are really the same: no row in the Locations table means there is no specific location, period, LEFT JOINs or not, doesn't matter.

As far as joining smaller int columns for performance reasons, what about designing the tables so that joins are not necessary at all, or rarely? Natural keys make that a lot easier, and no matter how small the joining column is, if two or more tables are involved you end up having multiple extents involved, therefore more I/O (and less sequential and efficient I/O as well) It is a serious overgeneralization to say that smaller keys lead to better performance, especially if smaller keys require more tables. I'm wondering how many times you've actually used natural keys in order to say for certain they perform worse.

Go to Top of Page

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2003-01-02 : 18:32:48
Thanks for that Rob.

I don't want to go down the surrogate key debate here. But would like to add that this post is about logical design and not physical.

To clarify:
I should have said "one join".

Data integrity and meaning is your FIRST responsibility as a DBA, then worry about performance....

You can say what ever you like Jeff and honestly I was a touch offended but I got over it. I am always willing to learn and change my opinion but only when the competing advice/evidence is overwhelming. Your post has done nothing to change my position in this instance.

Jeff, much to the chargrin of some fellow SQL Teamers. I would strongly recommend that you visit [url]ww.dbdebunk.com[/url]. It is updated weekly and the authors involved are the best in the business.

DavidM

"SQL-3 is an abomination.."


Edited by - byrmol on 01/02/2003 18:41:48
Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-01-02 : 19:54:57
A long varchar() -- a client name -- as a primary key ???

Seriously?

If I have 1000 clients and I add that "expensive" ID for the primary key -- I just added 4000 bytes - 4K. Yes, there is an index too, feel free to calculate that and add it in. Let's say its 10,000 bytes -- 14K.

If I have 100,000 transactions for those clients, and of course they reference the client_id, so we have 400,000 bytes or 400K.

Thus, in addition to the data stored in these tables, we have use 414KB of memory. Joins are done by comparing 4 bytes to each other.

So, lets save the 14K and leave out the client_Id field. Lets say the client_name is 40 characters long -- a varchar(40) field.

So now, for those transactions, we nedd to store 100,000 X 40 = 4,000,000 bytes or 4MB of space just for the transaction table.

10 times as much storage.

Now, for joins, do a test for me: Join 100 to 100,000 records on indexed varchar(40) fields, and then do a join of 100 to 100,000 records on indexed int fields. Let me know the results of which is more efficient.

Next, do some data entry. Make sure that when you enter transactions, you type in the customer name EXACTLY character for character each time -- off by 1 letter, you violate the primary key.

Then, later on, change the name of a few clients and let the triggers do all the work to update all of the transactions and compare that doing nothing when using an ID.

Finally, lets create some budget tables, a table of client addresses, a revenue transaction table, invoices, cash receipts, contracts, timesheet transactions, and bring it all together and decide if the best way to do our linking and data storage is by storing the entire client name in every record all these tables.

Of course, we would then store employee names, invoice types, contract types, office locations, department names, and everything else listed out a varchar's as well I suppose?

Breaking out the transaction table to have a transaction location table is fine, no biggie with that, except that it contradicts with the other things you are saying. You are now requiring a JOIN where you don't need one, but then you say store varchar()'s with client names in transactions so you don't need to join to get their names???

whew .. had to vent a little. Sorry to offend everyone, and I am not trying to open up a debate about natural keys like our good friend quazi bubble (I was on Rob's side on that debate for the most part), but ... come on !!!!!

I may be totally alone on all this, oh well, everyone can make up their own mind. I don't think you'll bring SQL to it's knees by or tax your programmers to much by making them join an integer client_Id to the clients table to get the client name!



- Jeff
Go to Top of Page

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-01-02 : 21:58:06
quote:
Now, for joins, do a test for me: Join 100 to 100,000 records on indexed varchar(40) fields, and then do a join of 100 to 100,000 records on indexed int fields. Let me know the results of which is more efficient.
That was the point I was trying to make earlier: properly designed and used, you would not NEED to join two tables at all in order to find the transactions for a customer. And it's a meaningless test anyway; tables can be designed in every possible way to make either method perform better or worse than the other. If you haven't come across that yet, you will.

Here's the performance problem I alluded to in more detail: you have a customer ID that you store in your transaction table. Naturally (unnaturally? who knows?) you'll also have a transaction ID as your primary key in the transactions table, and that will act as your clustered index as well (you'd have to make an effort to make it otherwise, and most people won't...myself included) So in joining and/or querying for a specific customer ID, the query will have to do all kinds of I/O in order to return the rows, quite possibly even a table scan. Whereas a composite key on customer ID and transaction date, for instance, would very likely improve performance while maintaining the uniqueness required for the transaction table's data (or whatever other attribute(s) make up the key for that table)

And even if integer keys are smaller and more efficient, I don't see how much more efficient it is to have to read an entire index, or even one data page, into cache when only one key is needed for a particular query. On the other hand, if a natural key in one table reads index pages that contain actual, useful data for the query (covered indexes) that's always a plus. Surrogate keys don't work as covering indexes.
quote:
Next, do some data entry. Make sure that when you enter transactions, you type in the customer name EXACTLY character for character each time -- off by 1 letter, you violate the primary key.
OK, and how is it better to require the DE operator to look up an account number, WRITE IT DOWN or memorize it, and still potentially mistype it later? Oh yeah, if they're off by one digit, they can easily choose a legitimate account number, it just happens to be the wrong customer. How do you go about fixing that one? By having the name as the key, mistyping it would throw an error and FORCE them to fix it right away (and using a surrogate key to avoid throwing such errors is dumb; you'll waste more time cleaning up dupes later on than you'd ever lose in writing proper error handling...I know, I've done it myself and seen plenty others do it too)

And how many times has a customer had multiple accounts, and transactions are matched to the wrong account number and need to be fixed? And if a customer DOES change their name, are they really the same customer? SHOULD they be allowed to change the name on their account? Or should they be required to open a new account? This is a serious question that could very well have legal ramifications outside of your particular database (it did in several of mine) An account number key would not prevent these kinds of changes, but a natural key with no cascade actions would. Your particular situation may not require it, but someone else's might.

There's arguments for both sides, and if surrogates work for you, then use them. But blindly using them will bite you in the ass at some point. They should NEVER be considered as the first key candidate, ever. You've at least made the effort to explain your reasoning for using them. Lots of people do not, and it hurts them when a common and accepted practice is expounded without explaining any downfalls, misconceptions, or alternatives. This was essentially the point of my article, which unfortunately I did not make clear enough.

And don't forget David's point: data integrity is primary, performance secondary at best. The world's fastest database is worthless if its data is shit. Surrogate keys may improve performance in some circumstances but they NEVER do anything for data integrity, unless there is literally no other possible key available in the data being modeled.

As far as having two tables for transactions and locations, I don't see a major problem in having one table with a nullable column for location, except in the case where the location is not the only attribute that needs to be recorded. If there is a delivery date, for example, or other columns that must be filled in, keeping them in the same table would require all of these extra columns to be nullable, and you've got a hell of a CHECK clause or trigger to write to maintain them properly. Splitting them out into a separate location table permits all those columns to be required with no extra overhead for integrity checks. And the locations table can be used to track multiple locations for the same transaction, if such a feature were needed in the future. I'd go with two tables in this instance UNLESS they performed far worse in comparison to one table (demonstrated through tests of both methods, naturally)

Whew! How's that for keeping the debate alive?

Go to Top of Page

byrmol
Shed Building SQL Farmer

1591 Posts

Posted - 2003-01-03 : 03:22:57
Jeff,

I repeat.. This is about LOGICAL design not a physical one. This is one of the great confusions in the DBMS industry. It is in the same vain as "De-normalise for performance". Normalisation is a LOGICAL issue and thus the statement is a crock of shit. (Now that will add some spice to the debate)

quote:
Breaking out the transaction table to have a transaction location table is fine, no biggie with that, except that it contradicts with the other things you are saying. You are now requiring a JOIN where you don't need one, but then you say store varchar()'s with client names in transactions so you don't need to join to get their names???


What other things? It seems that you don't understand the model here.

Jozza has said that about 25% of Transactions are Location dependant.
Inversely 75% don't care about the Location.

So we split the table into 2 parts. All Transactions are entered into the TransactionDetail table. Only those with a location have a corresponding row in the TransactionLocation table.

Now the whole point of you using "Unspecified" as the Location was to differentiate between the different "types" of Transactions. So instead of letting the front end figure out which is which, the model I propose already "knows".

A nice view joining the TransactionLocation and TransactionDetail is the ONLY join I will need given the requirements by Jozza.

quote:

Now, for joins, do a test for me: Join 100 to 100,000 records on indexed varchar(40) fields, and then do a join of 100 to 100,000 records on indexed int fields. Let me know the results of which is more efficient.


Oh please... You really think I don't know which will join faster? Now you are pissing me off.
No joins are faster than joins. Let me know when you can prove otherwise.

To summarise you post...disk space is more important than integrity and speed.
I really hope you see how I came to that conclusion.

DavidM

"SQL-3 is an abomination.."
Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-01-03 : 09:23:53
Guys -- a client name means nothing. If you are writing extremely simple reports or queries that show a client name and don't show any client info, with just detail or totals from the transactions table, and that is all you require ever in your database, then yes -- storing the client name with each transacion will save you that "expensive" join.

But real useful reporting requires clients locations, client types, addresses, notes, primary contacts, salesman, or any other info -- YOU HAVE TO JOIN TO THE CLIENTS TABLE OR OTHER TABLES MANY TIMES for comprehensive reporting. Or do we store all that info as well in each transaction?

So, I guess the problem is we are discussing different things. If you have simple databases that just list a few transactions and complexe reporting is not a requirement and there is no other data that needs to be incorporated with your transaction table -- yes, go ahead and store client names in each transaction. YOu could also use Excel for this type of storage, might be easier (that is not sarcasm).

But if you need other client info in your reporting or modelling, have other tables that need to link to your client table (and that can be large), it does not make sense to store client names in all of these tables.

Most importantly : I don't recall ever saying that client_ID should be an identity or blindly assigned. Completely different discussion that has nothing to do with any of my points! If that is what you guys are refering to (using client_name vs an Identity as the primary key of the table) than that is a completely different debate. This is a key point to understand throughout my posts in this thread.

About the transaction locations -- you keep saying you don't need joins???? Write me a little SQL to show me some transaction detail and whether or not it has a location. Even if 99% of transactions don't have a location, you still need to join to the transaction location table to determine this for a given transaction.

Hopefully that clears some stuff up. And I am 100% talking about physical design; I also don't care much about theory. I personally do what works best for both speed and performance. I rarely use identities, don't blindly normalize, and test various configurations before deciding on a final one.

And finally: The size of your DB isn't just storage; it affects performance as I hope we all know.

- Jeff
Go to Top of Page

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2003-01-03 : 09:35:23
quote:
And I am 100% talking about physical design; I also don't care much about theory. I personally do what works best for both speed and performance.


Hum....that's a bold statement for a database professional.

So tell us, if you are so keyed in on speed, how much faster is an int typed key than a char typed key?

Jay White
{0}
Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-01-03 : 09:47:24
I am amazed I'm being picked apart on this one.

People read the posts, ignore what they don't understand or have no response for, and then quote back 1 line that has nothing to do with anything and then personally try to attack you.

I agree, then -- storing a client_id is a small int or short char() primary key is a bad idea. Leave a varchar() as your primary key which is the client name, store it in all transactions, and all tables that need to refernce your client table also should store the complete client name in their records as well.

Could've been a nice fruitful discussion about different design techniques, but I give up.

I am amazed not 1 person on this forum has any clue as to what I am talking about.

You may want to ban me from the forums, with my crazy ideas I am probably polluting the minds of those i am trying to help, doing much more harm than good!

Edited by - jsmith8858 on 01/03/2003 09:48:09
Go to Top of Page

Onamuji
Aged Yak Warrior

504 Posts

Posted - 2003-01-03 : 10:27:53
the sickness is spreading ... ever since this stupid discussion ... http://www.sqlteam.com/Forums/topic.asp?TOPIC_ID=6136 ... everyone seems to be going bonkers around here ...

Go to Top of Page

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2003-01-03 : 10:37:30
Now, looky here ... I'm just a country boy without too much good book learnin'. I ain't gots me one o 'dem fancy gen-u-wine MCDBAs. Now, I tried real good to work through your postin' and I gots me a question rattlin around in my noggin. You seem to be saying that using a varchar datatype as a primary key slows down a system, especially as the varchar gets longer. And I'm just sittin her a'wondering "How much slower?"

Come on, man ... no one is picking you apart. This natural key vs. surrogate key banter is as old as the relational model itself. I'm just wondering if you actually done any testing ... gathered any metrics ... to back up your claim that a natural key is slower. I did. Yeah, it can be slower on joins ... not by much. Weight that against the number of queries that no longer require a join and it can pretty much come out in the wash. I'm going to ask anybody that takes a stand on the surrogate key side of the fence for performance reasons this same question. I'm going to keep doing it until I get a good answer. By all means ... do some tests and write an article.

Also, I think you do care about theory and I think you do care about data integrity before speed and performance. I don't know why you would say that you don't.

Jay White
{0}
Go to Top of Page

robvolk
Most Valuable Yak

15732 Posts

Posted - 2003-01-03 : 11:11:53
Jeff-

You don't have crazy ideas, you're not "infecting" anyone here (eeeeeewwwwww, I'm thinking of something else now), and we understand very well what you're talking about. We simply don't agree that it's BETTER to add surrogate key columns when existing columns are naturally unique. To be fair, you haven't addressed that part of OUR arguments, and you've given as well as you've gotten (good for you!) And I gotta say, if you think YOU'RE being attacked, re-read the comments for the identity article. You'd think I was Hitler in a past life or something.

The problem is, without the kind of detail we've gone into here, if someone followed your advice Jeff, they could easily design a table like this:
ID(PK) State   StateName
1 AK Alaska
2 AL Alabama
3 AR Arizona
...
And they would never realize that they've wasted space and compromised their data's integrity. Again, a primary key's purpose is maintaining the uniqueness of the data, not the performance of the database, and it does no good to have a high-performance DB that has crap data or dupes or anything else, things that surrogate keys don't guard against without additional (and unnecessary) effort.

You're also tailoring an example that is specifically and obviously a bad choice for a key, and INSISTING on a design situation that hampers performance, and then making broad statements about how the design/theory is the sole culprit and should be avoided. That's not what I'm defending, and that does not mean natural keys ALWAYS lessen performance. That also does not mean a surrogate key would be a better choice, and again, I can't see how adding another column to a table addresses that problem. If you end up designing tables that need a join anyway, why add a surrogate key column if it's not needed? And there are such things as tables that aren't referenced by other tables; look in Northwind and you'll find a few. They even use multi-column keys (fancy that!)

And there are considerations beyond performance...as you said, Excel would be preferable for some of the things described here. The point is, Excel does not provide the features needed for data integrity; otherwise, you'd use it instead of SQL Server, right?

For the record, I've never said your designs are bad, I just don't agree that they're always better. That's the heart of this discussion. They work fine in your situation, but your situation is NOT typical of everyone else's...nor is mine. I'm not forcing anyone to always use natural keys; (shut up Femenella!) I DO insist that they consider a natural key first, and make the effort to defend why they CHOSE a surrogate key instead. Everyone knows that that rarely happens, and we've all failed to do that ourselves at some time. If people come out of that process with a surrogate key, that's fine...as long as they go through the PROCESS.

I get the feeling that JozzaTheWick will never post another question again.

Edited by - robvolk on 01/03/2003 11:12:41
Go to Top of Page

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2003-01-03 : 11:18:24
First, can I re-emphasis that my discussion has nothing to do with natural vs. surrogate keys? OK.

Can someone please address all of my points, not just one; and can people stop making up points (like the natural key thing) that I am not even saying?

So, can someone address these points directly:

1) useful reporting will not just show a client's name, but primary contact, address, salesman, location, client_type or something similiar, etc. Should all of this data be stored in each transaction? if not, we need to join to the client table and other tables for this data. We do not elimate joins by storing the client name in every transaction, unless it is extremely simple reporting.

2) Other tables may need to reference clients, such as contracts, budgets, invoices, labor transactions, etc. Should the entire client name be stored in all transactions in all of these tables?

3) We also have employees, departments, divisions, offices, employee_types, invoice_types, etc. Should we store the full text description of all these in all transaction tables? Each of these components, like clients, also has other attributes (other than names) that we will need to reference for reporting, calculating, etc -- joins to these tables will also be required.

I also have no certifications, never went to a 2-day "learn to be a SQL DBA" course. Just have a CS degree and what I consider a good understand of programming, computers and logic. There are many aspects of database theory i follow, because I have found it works, and many others I don't follow, know about, or even care about.

- Jeff
Go to Top of Page
  Previous Page&nsp;  Next Page

- Advertisement -