Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2012 Forums
 Transact-SQL (2012)
 Compare two tables with large size

Author  Topic 

WAmin
Starting Member

16 Posts

Posted - 2014-04-15 : 08:35:59
Hi,

I have 2 tables, first table have 65 million records and second table have 20 million records.
I want to compare if text field "InventoryItem" in TABLE01 is different from TABLE02 base on Field "ItemCode".

Structure of table is below



CREATE TABLE TABLE01
(
ID INT IDENTITY(1,1) PRIMARY KEY,
FirtNAME VARCHAR(50),
LastNAME VARCHAR(50),
PhoneNumber VARCHAR(20),
FaxNumber VARCHAR(20),
Department INT,
City INT,
State INT,
Country INT,
ItemCode VARCHAR(13),
InventoryItem VARCHAR(50))


CREATE TABLE TABLE02
(
ID INT IDENTITY(1,1) PRIMARY KEY,
FirtNAME VARCHAR(50),
LastNAME VARCHAR(50),
PhoneNumber VARCHAR(20),
FaxNumber VARCHAR(20),
Department INT,
City INT,
State INT,
Country INT,
ItemCode VARCHAR(13),
InventoryItem VARCHAR(50))


As this is on live system, to process records i copied specific column to Staging database to delete irrelevant records before updating in live system.



CREATE TABLE TABLE01
(
ID INT,
ItemCode VARCHAR(13),
InventoryItem VARCHAR(50))


CREATE TABLE TABLE02
(
ID INT,
ItemCode VARCHAR(13),
InventoryItem VARCHAR(50))


To staging server
Now i created INDEX over ItemCode and InventoryITEM on both tables (which itself take long time),



CREATE INDEX IDX01 ON TABLE01 (ItemCode ASC,InventoryItem ASC)
CREATE INDEX IDX01 ON TABLE02 (ItemCode ASC,InventoryItem ASC)



And then execute following query



DELETE FROM TABLE02 WHERE EXISTS (SELECT * FROM TABLE01 WHERE TABLE01.ItemCode=TABLE02.ItemCode AND TABLE01.InventoryItem=TABLE02.InventoryItem)



This one is another query taking very long.

Is there another easy/faster way to achieve this goal without spending hours.
I did tried creating hash for inventoryITEM column by using CHECKSUM but later read CHECKSUM can return same CHECKSUM for different value, so i cannot use that.

Any hint would be very helpful.

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2014-04-15 : 11:53:16
how many distinct ItemCode values are there across these 65m and 20m row tables?

if Table01 has:
ID = 1, ItemCode = 'ItemCode_01', InventoryItem = 'InventoryItem_01'
ID = 2, ItemCode = 'ItemCode_01', InventoryItem = 'InventoryItem_02'

and table02 has the same two rows:
ID = 1, ItemCode = 'ItemCode_01', InventoryItem = 'InventoryItem_01'
ID = 2, ItemCode = 'ItemCode_01', InventoryItem = 'InventoryItem_02'

Both rows be deleted from table02?

Be One with the Optimizer
TG
Go to Top of Page

stepson
Aged Yak Warrior

545 Posts

Posted - 2014-04-15 : 15:05:27
I don't have your environment(so I can't test it)... but an idee

;with aCTE
AS
(SELECT
B.ID
FROM
TABLE01 A
INNER JOIN
TABLE02 B
ON A.ItemCode=B.ItemCode
AND A.InventoryItem<>B.InventoryItem)


DELETE A
FROM TABLE02 A
INNER JOIN aCTE B
ON A.ID=B.ID






sabinWeb MCP
Go to Top of Page

Lamprey
Master Smack Fu Yak Hacker

4614 Posts

Posted - 2014-04-15 : 16:11:33
Hard to say since we don't know the number of deletes happening. But, you might want to batch the deletes...?
Go to Top of Page

WAmin
Starting Member

16 Posts

Posted - 2014-04-15 : 23:51:12
Just a bit more clarification,
-> TABLE02 would refresh every month and i have to update information in TABLE01 base on new records in TABLE02. Everything was ok till customer complain they also update description field which is InventoryItem.
-> Thing don't end here, sometimes in new file customer send us partial information.Now my job is not only to find if there is change in description but also to check if new information is partially available and i must not update those information.
Lets say my current InventoryITEM is "SCREWDRIVER,FLAT TIP" but new information came as "SCREWDRIVER,FLAT TI". Doing <> mean i have new information available but thats not the case but new information is truncated so i have to ignore new info.

@TG, There can be multiple same ItemCode with different InventoryITEM records. So yah it is not distinct.
For same itemCode there will be same InventoryITEM.
For your second question yes, we need to remove those records from table02.

@stepson, thank you. but again coping and indexing text fields taking ages and deletion seems taking forever.

@Lamprey, I did finish process and from 20 million records 18.5 mil were deleted as information is same in those records.

At the end i have 2 scenarios
1- DELETE ALL THOSE Where "TABLE01.ItemCode=TABL02.ItemCode AND TABLE01.InventoryITEM=TABLE02.InventoryITEM"
2- DELETE ALL Those WHERE "TABLE01.ItemCode=TABL02.ItemCode AND SUBSTRING(TABLE01.InventoryITEM,1,LEN(TABLE02.InventoryITEM))=TABLE02.InventoryITEM"

I can do deletion by creating INDEX on ITEMCode and InventoryITEM but,
1- Copying 65+25 million records across take sometime.
2- Creating Index even take longer.
3- Deletion on scenario 1 straight forward.
4- Deletion on scenario 2 also taking forever because of partial checks.

I know it can't be done in minutes but i really don't want to wait for like 12 hours to finish this job (unless thats the only way).
Go to Top of Page

stepson
Aged Yak Warrior

545 Posts

Posted - 2014-04-16 : 01:08:31
try another approch:
-STEP1:copy the informations that you want to a new table : TABLE02_NEW (no index on it yet, and with identity set SET IDENTITY_INSERT TABLE02_NEW ON)
-STEP2:you can drop table TABLE02 (the infos are now in TABLE02_NEW)
-STEP3:rename TABLE02_NEW to TABLE02
-STEP4: add index and SET IDENTITY_INSERT TABLE02_NEW OFF





sabinWeb MCP
Go to Top of Page

WAmin
Starting Member

16 Posts

Posted - 2014-04-16 : 01:47:14
quote:
Originally posted by stepson

try another approch:
-STEP1:copy the informations that you want to a new table : TABLE02_NEW (no index on it yet, and with identity set SET IDENTITY_INSERT TABLE02_NEW ON)
-STEP2:you can drop table TABLE02 (the infos are now in TABLE02_NEW)
-STEP3:rename TABLE02_NEW to TABLE02
-STEP4: add index and SET IDENTITY_INSERT TABLE02_NEW OFF
sabinWeb MCP



Sabin, thank you.
For TABLE02 ID Column doesn't matter (would heap data would be an issue while indexing and deleting? Should i create cluster index on a column?). Because the main column which i need is ItemCode. Anyway my main issue is matching text which itself copying around, indexing and then matching and deleting takes way longer.
I am trying a new approach and will update once i finish that (if successful). Meanwhile all ideas are welcome (more heads are better than one ).
Go to Top of Page

WAmin
Starting Member

16 Posts

Posted - 2014-04-21 : 02:57:50
Just to update,
Was able to finish entire process from start to end in 24 minutes (first attempt to match text fields took more than 17 hours without any avail).

Trick was to convert text filed into Hash of integer type, in current version of sql server there is no option to convert huge data without getting duplicate due to design of checksum function.
So what i did, i convert the text into CONVERT(BIGINT,HASHBYTES('SHA1',ItemCode)) AND CONVERT(BIGINT,HASHBYTES('SHA1',InventoryItem)) and then put index on respective fields.


CREATE TABLE TABLE01
(
ID INT,
ItemCode BIGINT,
InventoryItem BIGINT)


CREATE TABLE TABLE02
(
ID INT,
ItemCode BIGINT,
InventoryItem BIGINT)



I also worked on partial description match. After that matched with live database and transfer records to live system (after match of 66mil to 30mil records, i got 125k updates).
I hope this could help someone else who is facing same problem with large database.
Go to Top of Page
   

- Advertisement -