When to JOIN - SQL Server Forums

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

When to JOIN

Author

Topic

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2005-03-31 : 03:04:51

http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=47738

So along comes PW and gets my Topic locked because he arrogantly jumps to the wrong conclusion!!!

Thats the first bad experience on this site and hopefully the last!

THIS IS A COMPLETELY DIFFERENT ISSUE
I am not talking about comma seperated values here!

Perhaps Rockmoose will be allowed to continue with his CONSTRUCTIVE attempt:
Table1
------
OldID NewGUID
1 {8E2F9FE5-F382-49B5-B439-81A0258E5251}
2 .....

Table2
------
OldID NewGUID
1 {8E2F9FE5-F382-49B5-B439-81A0258E5252}
2 .....

Table3
------
OldID NewGUID
1 {8E2F9FE5-F382-49B5-B439-81A0258E5253}
2 .....

Table4
------
OldID NewGUID
1 {8E2F9FE5-F382-49B5-B439-81A0258E5254}
2 .....

Table5
------
OldID NewGUID
1 {8E2F9FE5-F382-49B5-B439-81A0258E5255}
2 .....

OldHistoryTable
---------------
ID TheDate Tbl1ID Tbl2ID Tbl3ID Tbl4ID Tbl5ID
1 13/5/89 1 3 2 1 20
2 14/5/89 4 5 6 8 2
3 ...

NewHistoryTable
---------------
ID TheDate Tbl1NewGUID Tbl2NewGUID Tbl3NewGUID Tbl4NewGUID Tbl5NewGUID

Does that help?

robvolk
Most Valuable Yak

15732 Posts

Posted - 2005-03-31 : 07:47:02

Bob, relax. I locked the topic so that people follow the original thread and not the new one, we do it all the time. Unless you are substantially changing the topic, please post follow-ups in the original thread, don't start another. It's too easy for readers to get lost. Ignore PW.

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2005-03-31 : 08:43:37

I still say this is a seperate issue and merits a discussion in its own right.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2005-03-31 : 09:41:41

What are you trying to here? What is the NewHistory table storing? I mentioned before about normalization, and you indicated you were a big believer/follower of it, so what are you trying to do here?

All you need to do is SELECT from your old history table, and do a join to each of your OldID/NEwGuidID tables on each column, and insert the results into your new table:

INSERT INTO NewHistoryTable (ID, TheDate, Tbl1NewGUID, Tbl2NewGUID, Tbl3NewGUID, Tbl4NewGUID, TblNewGUID)
SELECT Old.ID, Old.TheDate, T1.NewGUID, T2.NewGUID, T3.NewGUID, T4.NewGUID, T5.NewGUID
FROM OldHistoryTable Old
LEFT OUTER JOIN Table1 T1 ON Old.Tbl1ID = T1.OldID
LEFT OUTER JOIN Table2 T2 ON Old.Tbl2ID = T2.OldID
LEFT OUTER JOIN Table3 T3 ON Old.Tbl3ID = T3.OldID
LEFT OUTER JOIN Table4 T4 ON Old.Tbl4ID = T4.OldID
LEFT OUTER JOIN Table5 T5 ON Old.Tbl5ID = T5.OldID

But, again, I really question your table structure and what you are tyring to accomplish here ...

- Jeff

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2005-03-31 : 11:08:19

Thanks for answering Jeff.
Tell me if this is not 3rd normal form:
I have a history table that stores records for things that took place.
Those 5 ID's represent 5 different objects that were involved in that event eg if I dream a few entities/tables up for you: Companies, People, Contracts, Locations,...

What other way is there to do this? Whats wrong with it?

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2005-03-31 : 11:36:43

can an event ever occur to more than 1 of those entities? I.e., can an event happen in more than 1 location or with more than 1 person?

Typically, you'd have the master table of Events with only the data that is completely dependant on the event and is 1-1. (i.e., datetime like you mentioned, or description, or category, or whatever is fully dependant on that particular event and nothing else). Then you'd have related tables like EventPlaces or EventCompanies which store many-to-many relations between Events and the things you wish to link to the events.

So, it an event can only take place in 1 location and for 1 company, put those values in your Events table. But if it can include many people or many departments, break those out into separate many-to-many tables, with primary keys of EventID/PersonID or EventID/DepartmentID for example.

- Jeff

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-03-31 : 12:53:44

quote:
Originally posted by coolerbob

Thanks for answering Jeff.
Tell me if this is not 3rd normal form:
I have a history table that stores records for things that took place.
Those 5 ID's represent 5 different objects that were involved in that event eg if I dream a few entities/tables up for you: Companies, People, Contracts, Locations,...

What other way is there to do this? Whats wrong with it?

I would like to elaborate a little on the points that Jeff raised on the design.
And your question about 3NF.

You are basically storing an event involving 5 entities; (EVENT,A,B,C,D,E)
You table would be in 3rd normal form, but it would probably contain "multi-valued" dependencies.
I say probably, because I don't know exactly what you are doing with this table.

Consider this:

If the event can involve say 2 "A"s

EVENT  A    B  C  D  E
e1     a0   b  c  d  e
e1     a1   b  c  d  e

Then the information that b,c,d,e was involved with e1 is stored twice!

The solution is to further normalize the database (4NF)
Like Jeff suggested:

EVENT A
e1    a0
e1    a1

EVENT B
e1    b

EVENT C
e1    c

EVENT D
e1    d

EVENT E
e1    e

If You join those 5 tables on e1, You will in fact get this resultset:

EVENT  A    B  C  D  E
e1     a0   b  c  d  e
e1     a1   b  c  d  e

Now a note on "missing" information:
-------------------------------------------------------------------------------------
Furthermore, in the (EVENT,A,B,C,D,E) table it is not possible to store the information that
for example b,c,d,e were involved with EVENT e2, but no A was present:
(unless we introduce NULLS)

EVENT  A    B  C  D  E
e2     NULL b  c  d  e

But now we could theoretically get this in the database (due to NULLS, and what would the key be?):

EVENT  A    B    C    D     E
e3     NULL b    c    NULL  e
e3     a    NULL c    d     NULL

Which will make it really ugly to get the whole picture for e3.

rockmoose

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2005-04-01 : 02:36:15

OK, but isn't there quite a big trade-off in performance with all those extra tables?

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2005-04-01 : 04:09:59

I'm getting battered continueously on how things like 3rd and 4th normal impact performance. The line here is that "JOINS cost a lot. Sometimes things can't be pretty and you have to duplicate data or commit some sin to spare yourself a performance costly JOIN." I'm getting told to go and read the book on the core SQL engine to find out how it uses buffers and everything else before I dare suggest that the peformance hit of JOINs generally are neligable and worthwhile.
That's another reason why the table is the way it is at the mo. Any ammo for me anyone? Or are they right?

robvolk
Most Valuable Yak

15732 Posts

Posted - 2005-04-01 : 07:45:35

LOL, I love the irony of how THEY are telling YOU to go read a book on how SQL Server works. As if anyone there ever read Books Online (everything you need to know about buffers is in there with about 5 minutes' research) Perhaps you should point them to SQL Team. If they're so sure of their knowledge they'd thoroughly enjoy discussing it with us.

Your coworkers are extraordinarily misinformed. I still don't understand why all of this crap is going on. If there's something wrong with this app that you are trying to fix, it's pretty obvious that their earlier approach doesn't work, or else why would you fix it? Seriously, what is the purpose of your involvement? So far all of the arguments seem to be political rather than technical, basically they don't want to admit that they might have made a mistake in developing it.

I'd suggest avoiding all of the arguments. Do your research on your own, don't make suggestions for changes to anyone, don't ask their advice on anything except simple answers to simple questions. If you can test new techniques without other's involvement, do so. No one at your company seems to want to help you, so avoid them. If you can demonstrate clearly that your method is better, that's the best you can do. They'll either accept it or not.

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-04-01 : 09:16:32

>> The line here is that "JOINS cost a lot
No. SQL Server handles JOINS very well, with proper indexing, there is not much of a performance hit.

>> Sometimes things can't be pretty and you have to duplicate data or commit some sin to sapre yourself a performance costly JOIN.
You might gain som performance for some SELECT operations that can draw benefit from the denormalisation. Granted.
For practically every other operation You will lose performance.

Here is a testbed mimicking your situation as I understood it.
I create a "broad" Event table,
And a 4NF normalized schema with the same information.
The test data is 100000 events:

On my machine, the results were the following;
Retrieving all data for 10 random Events;
For 100000 events the normalized schema is faster(~3x)
For 200000 events the broad table is faster(~3x)

You can try the example if You want, just copy paste the code
(the cleanup is commented away, so run that after You are done)

-- "denormalized" event table
CREATE TABLE Event3NF
(
	EventID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT(NEWID())
	,EventDate DATETIME DEFAULT GETDATE()
	,AID UNIQUEIDENTIFIER NOT NULL DEFAULT(NEWID())
	,BID UNIQUEIDENTIFIER NOT NULL DEFAULT(NEWID())
	,CID UNIQUEIDENTIFIER NOT NULL DEFAULT(NEWID())
	,DID UNIQUEIDENTIFIER NOT NULL DEFAULT(NEWID())
	,EID UNIQUEIDENTIFIER NOT NULL DEFAULT(NEWID())
	,FID UNIQUEIDENTIFIER NOT NULL DEFAULT(NEWID())
	,GID UNIQUEIDENTIFIER NOT NULL DEFAULT(NEWID())
	,HID UNIQUEIDENTIFIER NOT NULL DEFAULT(NEWID())
)

-- "fully normalized" event tables
CREATE TABLE Event4NF
(
	EventID UNIQUEIDENTIFIER PRIMARY KEY DEFAULT(NEWID())
	,EventDate DATETIME NOT NULL DEFAULT GETDATE()
)

CREATE TABLE EventA4NF( EventID UNIQUEIDENTIFIER REFERENCES Event4NF(EventID),AID UNIQUEIDENTIFIER DEFAULT(NEWID()), PRIMARY KEY(EventID,AID) )
CREATE TABLE EventB4NF( EventID UNIQUEIDENTIFIER REFERENCES Event4NF(EventID),BID UNIQUEIDENTIFIER DEFAULT(NEWID()), PRIMARY KEY(EventID,BID) )
CREATE TABLE EventC4NF( EventID UNIQUEIDENTIFIER REFERENCES Event4NF(EventID),CID UNIQUEIDENTIFIER DEFAULT(NEWID()), PRIMARY KEY(EventID,CID) )
CREATE TABLE EventD4NF( EventID UNIQUEIDENTIFIER REFERENCES Event4NF(EventID),DID UNIQUEIDENTIFIER DEFAULT(NEWID()), PRIMARY KEY(EventID,DID) )
CREATE TABLE EventE4NF( EventID UNIQUEIDENTIFIER REFERENCES Event4NF(EventID),EID UNIQUEIDENTIFIER DEFAULT(NEWID()), PRIMARY KEY(EventID,EID) )
CREATE TABLE EventF4NF( EventID UNIQUEIDENTIFIER REFERENCES Event4NF(EventID),FID UNIQUEIDENTIFIER DEFAULT(NEWID()), PRIMARY KEY(EventID,FID) )
CREATE TABLE EventG4NF( EventID UNIQUEIDENTIFIER REFERENCES Event4NF(EventID),GID UNIQUEIDENTIFIER DEFAULT(NEWID()), PRIMARY KEY(EventID,GID) )
CREATE TABLE EventH4NF( EventID UNIQUEIDENTIFIER REFERENCES Event4NF(EventID),HID UNIQUEIDENTIFIER DEFAULT(NEWID()), PRIMARY KEY(EventID,HID) )
GO

-- insert test data
-- create 200000 random events...
PRINT 'Generating 100000 random events, (same info in "broad" table and in fully normalized schema) ...'; PRINT ''
INSERT Event3NF(EventID) SELECT TOP 100000 NEWID() FROM master..spt_values t1 CROSS JOIN master..spt_values t2
SET NOCOUNT ON
GO
INSERT Event4NF(EventID) SELECT EventID FROM Event3NF
INSERT EventA4NF(EventID,AID) SELECT EventID,AID FROM Event3NF
INSERT EventB4NF(EventID,BID) SELECT EventID,AID FROM Event3NF
INSERT EventC4NF(EventID,CID) SELECT EventID,AID FROM Event3NF
INSERT EventD4NF(EventID,DID) SELECT EventID,AID FROM Event3NF
INSERT EventE4NF(EventID,EID) SELECT EventID,AID FROM Event3NF
INSERT EventF4NF(EventID,FID) SELECT EventID,AID FROM Event3NF
INSERT EventG4NF(EventID,GID) SELECT EventID,AID FROM Event3NF
INSERT EventH4NF(EventID,HID) SELECT EventID,AID FROM Event3NF
GO

-- create view over the "fully normalized" schema, to mimic the "denormalized" schema
-- this is just to make retrieval easy to present
CREATE VIEW vEvent4NF WITH SCHEMABINDING AS
SELECT
	nf4.EventID
	,nf4.EventDate
	,anf4.AID
	,bnf4.BID
	,cnf4.CID
	,dnf4.DID
	,enf4.EID
	,fnf4.FID
	,gnf4.GID
	,hnf4.HID
FROM
	dbo.Event4NF nf4
	JOIN dbo.EventA4NF anf4 ON nf4.EventID = anf4.EventID
	JOIN dbo.EventB4NF bnf4 ON nf4.EventID = bnf4.EventID
	JOIN dbo.EventC4NF cnf4 ON nf4.EventID = cnf4.EventID
	JOIN dbo.EventD4NF dnf4 ON nf4.EventID = dnf4.EventID
	JOIN dbo.EventE4NF enf4 ON nf4.EventID = enf4.EventID
	JOIN dbo.EventF4NF fnf4 ON nf4.EventID = fnf4.EventID
	JOIN dbo.EventG4NF gnf4 ON nf4.EventID = gnf4.EventID
	JOIN dbo.Eventh4NF hnf4 ON nf4.EventID = hnf4.EventID
GO

-- create small search table to retrieve data for 10 randomly chosen EventID's
-- search these 10 random events from the 3NF, 4Nf tables;
CREATE TABLE #searchevents(EventID UNIQUEIDENTIFIER)
INSERT #searchevents(EventID) SELECT TOP 10 EventId FROM Event3NF ORDER BY NEWID()
GO

-- Start performance testing
-- Over the "denormalized" table and the "fully normalized" tables
--CREATE UNIQUE CLUSTERED INDEX ixXXX ON vEvent4NF(EventID,EventDate,AID,BID,CID,DID,EID,FID,GID,HID)
DECLARE @d DATETIME

DBCC DROPCLEANBUFFERS; DBCC FREEPROCCACHE; PRINT '' -- Clear cached data
PRINT 'Counting total no of events from "broad" table...'
SET @d = GETDATE()
SELECT COUNT(*) AS CntEvents3Nf FROM Event3NF
SELECT DATEDIFF(ms,@d,GETDATE()) AS elapsed_time3nf_Count

/*DBCC DROPCLEANBUFFERS; DBCC FREEPROCCACHE -- Clear cached data
SET @d = GETDATE()
SELECT COUNT(*) AS CntEvents4Nf FROM vEvent4NF
SELECT DATEDIFF(ms,@d,GETDATE()) AS elapsed_time4nf_Count*/

DBCC DROPCLEANBUFFERS; DBCC FREEPROCCACHE; PRINT '' -- Clear cached data
PRINT 'Counting total no of events from "4nf" table...'
SET @d = GETDATE()
SELECT COUNT(*) AS CntEvents4Nfb FROM Event4NF
SELECT DATEDIFF(ms,@d,GETDATE()) AS elapsed_time4nf_Count

DBCC DROPCLEANBUFFERS; DBCC FREEPROCCACHE; PRINT '' -- Clear cached data
PRINT 'Retrieving 10 random events from "broad" table...'
SET @d = GETDATE()
SELECT * FROM Event3NF WHERE EventID IN(SELECT EventId FROM #searchevents) ORDER BY 1,2
SELECT DATEDIFF(ms,@d,GETDATE()) AS elapsed_time3nf_Select10

DBCC DROPCLEANBUFFERS; DBCC FREEPROCCACHE; PRINT '' -- Clear cached data
PRINT 'Retrieving 10 random events from 9-table join of 4nf normalized tables table...'
SET @d = GETDATE()
SELECT * FROM vEvent4NF WHERE EventID IN(SELECT EventId FROM #searchevents) ORDER BY 1,2
SELECT DATEDIFF(ms,@d,GETDATE()) AS elapsed_time4nf_Select10
GO

/*
-- Clean up
DROP TABLE #searchevents
DROP VIEW vEvent4NF
DROP TABLE EventA4NF
DROP TABLE EventB4NF
DROP TABLE EventC4NF
DROP TABLE EventD4NF
DROP TABLE EventE4NF
DROP TABLE EventF4NF
DROP TABLE EventG4NF
DROP TABLE EventH4NF
DROP TABLE Event3NF
DROP TABLE Event4NF
*/
GO

rockmoose

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2005-04-01 : 09:48:54

As rockmoose states, remember that SQL Server and other relational databases are truly optimized for one primary goal: to efficiently process joins. That's the whole point of a RDMS, the reason it exists!

- Jeff

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-04-01 : 10:20:20

Some more rantings...

You should create a fully normalized logical model of your business data.
And not initially worry about any (technical) performance issues.
Once You have the model, then You are better prepared to design the physical implementation of it.

IMC a fully normalized phyical implementation is best.
If there arises a situation where it makes sense to precalculate and store information
for faster retrieval, then that will be a physical implementation issue, and it can be solved in several ways.
If You start by implementing a denormalized structure, the db will be much less flexible.

rockmoose

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2005-04-01 : 10:30:10

I know I am badly paraphrases a common quote or saying, but it goes something like this: premature optimization is just about the worst thing a developer or database modeller can do.

- Jeff

rockmoose
SQL Natt Alfen

3279 Posts

Posted - 2005-04-01 : 11:30:17

quote:
Originally posted by jsmith8858

I know I am badly paraphrases a common quote or saying, but it goes something like this: premature optimization is just about the worst thing a developer or database modeller can do.
- Jeff

That was an optimization of my rant I dare say.
I read that quote somewhere...
[url]http://c2.com/cgi/wiki?PrematureOptimization[/url]

rockmoose

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2005-04-01 : 11:33:29

good link!

- Jeff

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2005-04-05 : 09:16:47

rockmoose,
i ran that test. The 4nf way was about 6 times slower on my system - just like it was slower on yours. So that proves that breaking the rules will make it faster. But how big a deal is it that the 4nf way is slower? Will the users notice it? That's my main concern. The old database/system uses a denormalized db. I have one shot at the new history table and I have to make the right decision. I think it will be a nightmare to go and change it later. It will be too much work and will affect to many other parts of the project.
How well does this 4nf approach scale? Used it on a big project with millions of history records and ten of these 4nf tables coming off the history table before?

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2005-04-05 : 10:59:18

>>So that proves that breaking the rules will make it faster

Faster SELECTs, sometimes, for very simple basic "list" queries w/o any math or complicated sorting or criteria/joins. A good example is: finding max value in a set of rows versus sets of columns -- that is much, much faster if your data is normalized and indexed properly. But just blindly joining everything there is and returning all rows, then denormalized might be faster. But with joins, filters, sorts, aggregates, etc -- normalized will just about always be faster using proper indexes and querying techniques.

As for UPDATES, INSERTS, DELTES, sorts, data maintainence, imports, exports, data verification, and virtual every other operation a database can perform: Normalized will always be faster, simplier, and more efficient overall.

The main point is, and this is important: never compare the efficiency of algorithms or programs that return different results. The one that returns correct results is ALWAYS "faster". Be sure to read the examples about data integrity issues of not normalizing in this case, and things like what I mentioned about more than 1 person attached to an event and how you will store it.

Long story short: If using a particular technique requires that you can't store the data correctly or in an efficient manner, then there's no point in measuring the efficiency of that techinique.

- Jeff

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2005-04-05 : 12:15:32

So in your opinion guys, which is a better design:
1. A Central History Table Strategy
So you would have:
EventHistory Table with columns: EventGUID, EvDesc, EvNotes, EvDate
EventTypes Table with columns: LinkGUID, EventGUID, Type

And then all the link tables like you suggested:
EventCompanies
EventContracts
EventSites
and so on... I could end up with 10 of these tables

OR
2. A History Table for each Entity
So you would have:
Companies
CompanyHistory
Contracts
ContractHistory
Sites
SiteHistory

Each of these history tables would have columns for the other entities that this specific table could have interacted with.

jsmith8858
Dr. Cross Join

7423 Posts

Posted - 2005-04-05 : 12:54:41

It's hard to choose the best physical model for your database if we don't know the logical model. You need to be able to express, in regular English sentences, all of the relationships between the entities you are dealing with.

I also don't understand you EventTypes table, with the LinkGUID and EventGUID columns ... if it is a table of types, in which you can classify each event, you wouldn't need a GUID or a link back to the Events table. It would work the other way around, Events link to an EventType.

Again, answering the specific questions posed in this thread (can an Event have more than 1 person? What else is fully dependant on each event? etc) and more information is required.

If you find as you go through this process you cannot fully state the relations between the various entities you are modelling very precisely from a logical standpoint, it is time to go back to the source and make sure you understand the requirements of the users and the data they need to store.

- Jeff

coolerbob
Aged Yak Warrior

841 Posts

Posted - 2005-04-06 : 03:46:35

Wise words. I will do that. Thankyou very much.

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources