Generalized questions about Joins and Distinct

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Generalized questions about Joins and Distinct

Author

Topic

Awreilly
Starting Member

2 Posts

Posted - 2010-01-16 : 01:48:05

Hi,

I'm attempting to join two tables, and have discovered that for no apparent reason, one of the tables has appeared to have doubled about 50% of the rows in the table. I don't have much experience in SqlServer 2005, but I was wondering if there was some way to insert Distinct into a join so that when the tables join (They have the same primary key), the data duplication in one of the tables doesn't carry over to the join. Throwing the word distinct after the ON in the join just brings up a syntax error.

A few notes:
--These tables are quite large, and they were automatically set up and automatically derived data from a program that is integral to our company. Deleting the extra rows in not an option, because not only are the tables large, but I've already seen that messing with the data in SqlServer directly can cause that particular program to stop functioning in unexpected and non-sensical ways.

The basic structure is
SELECT <horde of fields from both tables>
FROM emdbuser.Summary as s
LEFT OUTER JOIN emdbuser.Borrower as B
ON s.Guid = b.Guid

(b.Guid being the one where the table has 50% more data than it should)

Thanks.

sql-programmers
Posting Yak Master

190 Posts

Posted - 2010-01-16 : 03:56:06

Hi,

You need to add DISTINCT in the follolwing place,

SELECT DISTINCT <horde of fields from both tables>
FROM emdbuser.Summary as s
LEFT OUTER JOIN emdbuser.Borrower as B
ON s.Guid = b.Guid

If you still get the problem, give the example records and column names. It will be easier to answer.

SQL Server Programmers and Consultants
http://www.sql-programmers.com/

Kristen
Test

22859 Posts

Posted - 2010-01-16 : 05:09:42

Personally I would not add a DISTINCT as that will just hide the problem (and give SQL 50% more work to do!)

I think your problem is that for each GUID in your [Summary] table there are two matching rows in [Borrower] with the same GUID value. (Actually there may be some GUIDs that match 1:1 and others that match 1:many, giving the impression of 50%??)

If the multiple records, for a given GUID value, in [Borrower] are identical you presumably need to remove the duplicates in that table, and fix whatever allowed them to get there. (If the figure is exactly 50% this is the likely scenario)

If the records are not identical maybe you need to add an extra column in your join - perhaps [Borrower] has one Active records for each GUID and some Inactive / Archive records for the same GUID? Thus adding "AND B.IsActive = 1" to the JOIN would fix that

Awreilly
Starting Member

2 Posts

Posted - 2010-01-21 : 18:56:53

This is what I've tried writing:

select distinct b.Guid, BorrowerFirstName, BorrowerLastName,
LoanNumber, LoanFolder, LoanAmount, Address1, Zip, City,
Lender, EscrowVendor, TitleVendor, CoBorrowerFirstName,
CoBorrowerLastName, AppraisalVendor, LoanPurpose,
LienPosition, LastModified,DateFileOpened, LoanProcessorName,
HomePhone, WorkPhone, CellPhone
from emdbuser.LoanSummary s
left outer join emdbuser.LoanBorrowers b
ON s.Guid = b.Guid
where LoanFolder != '(Trash)'
order by s.BorrowerLastName

The result pulls 8643 rows (Note without the distinct it pulls 9012 rows). However, if I run

SELECT distinct Guid from emdbuser.LoanBorrowers there are only 6096 rows (There are also 6096 items in LoanSummary, and I'm sure that this number is correct). So using just distinct for some reason isn't fully working.

Deleting out the duplicated Guids isn't really feasible because there's 3000 of them, and the program that runs off of SQLServer (our loan origination software) has a tendency to glitch up badly if I start deleting things directly in Sqlserver.

Kristen
Test

22859 Posts

Posted - 2010-01-22 : 01:13:49

"I'm sure that this number is correct). So using just distinct for some reason isn't fully working."

I expect that is because the rows with duplicate GUIDs are not identical in the other fields you are using.

If your GUIDs are supposed to be unique then the problem is that something (e.g. your loan origination software) is creating multiple rows for a given GUID.

Either that or there are supposed to be duplciates, and your query needs to choose the correct one, of the duplciates, to join to - maybe some are marked as Archived or somesuch?

This will show you the number of rows, and the number of different guids

SELECT COUNT(*), COUNT(DISTINCT Guid) from emdbuser.LoanSummary

SELECT COUNT(*), COUNT(DISTINCT Guid) from emdbuser.LoanBorrower

and then to find some rows to compare:


SELECT TOP 100 [T_GUID]=guid, *
FROM emdbuser.LoanBorrowers
WHERE guid IN
(
    SELECT guid
    FROM emdbuser.LoanBorrowers
    GROUP BY guid
    HAVING COUNT(*) > 1
)
ORDER BY guid

and compare adjacent rows to see which columns are different for rows with the same GUID

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources