Denormalization to speed keyword search? - SQL Server Forums

Home | Weblogs | Forums

Site Sponsored By: SQLDSC - SQL Server Desired State Configuration

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2000 Forums

Transact-SQL (2000)

Denormalization to speed keyword search?

Author

Topic

nathans
Aged Yak Warrior

938 Posts

Posted - 2005-02-22 : 14:22:07

Guys,
I have a wide table [customer] that holds info such as phone, fax, social_security_number, etc.

When creating a new customer, I want to cross-reference all existing customers on certain fields, say phone, fax, and ssn and insert any matches into a separate table [data_match] with the employee_id, matched_column, matched_value.

I have written the procedure to search the [customer] table for these matches, but with 1 million + rows in the [customer] table this query takes a while, and with adding 1,000 customers per day this solution is not feasible.

SO - I have moved the cross-referenced data out into a name/value table that stores the column_name, and column value.
Account: 1234 would have the following rows for the 3 searched columns:


1234     phone     1234567890
1234     fax       0987654321
1234     ssn       123456789

I indexed the value column, and of course this makes the query lightning fast compared to the other solution... But I dont feel great about duplicating the data. Am I going about this wrong? I can index the other columns in the primary table, but this would really be 8 columns I would be indexing (I have included only 3 in this example).

Thoughts?

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2005-02-22 : 17:13:33

I think it would be simpler to have the indexes on the customer table directly. The other approach doesn't save anything in terms of disk, and you have a lot of additional work to maintain the lookup table.

quote:
Originally posted by nathans

Guys,
I have a wide table [customer] that holds info such as phone, fax, social_security_number, etc.

When creating a new customer, I want to cross-reference all existing customers on certain fields, say phone, fax, and ssn and insert any matches into a separate table [data_match] with the employee_id, matched_column, matched_value.

I have written the procedure to search the [customer] table for these matches, but with 1 million + rows in the [customer] table this query takes a while, and with adding 1,000 customers per day this solution is not feasible.

SO - I have moved the cross-referenced data out into a name/value table that stores the column_name, and column value.
Account: 1234 would have the following rows for the 3 searched columns:

1234 phone 1234567890
1234 fax 0987654321
1234 ssn 123456789

I indexed the value column, and of course this makes the query lightning fast compared to the other solution... But I dont feel great about duplicating the data. Am I going about this wrong? I can index the other columns in the primary table, but this would really be 8 columns I would be indexing (I have included only 3 in this example).

Thoughts?

Codo Ergo Sum

nathans
Aged Yak Warrior

938 Posts

Posted - 2005-02-22 : 17:56:04

I am somewhat hesitant to index those 8 columns and perform the search on the primary transactional table rather than the lookup table. The primary table holds all the data for all customers... I am only interested in searching the last 45 days of data. The lookup table would only hold this 45 day subset of data (< 1 mil rows) whereas the transactional table is ~10 million rows.

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2005-02-22 : 17:59:03

OK, but in your original post, you said you wanted to search all existing customers.

If you are going to maintain another table, you can use your approach, or use a flat structure. With only 45,000 ruws (1000/day*45days), it should be fast either way.

The trick will be maintaining your lookup data with current information so that you don't miss any additions or updates. That may be easier with a flat structure.

quote:
Originally posted by nathans

I am somewhat hesitant to index those 8 columns and perform the search on the primary transactional table rather than the lookup table. The primary table holds all the data for all customers... I am only interested in searching the last 45 days of data. The lookup table would only hold this 45 day subset of data (< 1 mil rows) whereas the transactional table is ~10 million rows.

Codo Ergo Sum

nathans
Aged Yak Warrior

938 Posts

Posted - 2005-02-22 : 18:04:11

Thanks Michael, I should have reviewed my orig post :)

Yes, I am working with a subset of data here, but its still a rather large chunk. Ill be searching email_address, phone, fax, and ssn for both customer / co-customer, so (8) total columns.

Any new ideas, now that you have the full story.

- Nathan

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2005-02-22 : 18:27:15

One thing you might try is storing CHECKSUM columns for each of the eight items, and putting indexes on them. Then you are doing really fast index lookups on INTEGER columns. This scales a lot better when you are doing searches on wide VARCHAR columns.


Select
	CUSTOMER_ID
from
	CUSTOMER_LOOKUP
where
	SSM_CHECKSUM = checksum(@SSN)	and
	SSN = @SSN
union
Select
	CUSTOMER_ID
from
	CUSTOMER_LOOKUP
where
	EMAIL_CHECKSUM = checksum(@EMAIL)	and
	EMAIL = @EMAIL
...and so on...

quote:
Originally posted by nathans

Thanks Michael, I should have reviewed my orig post :)

Yes, I am working with a subset of data here, but its still a rather large chunk. Ill be searching email_address, phone, fax, and ssn for both customer / co-customer, so (8) total columns.

Any new ideas, now that you have the full story.

- Nathan

Codo Ergo Sum

robvolk
Most Valuable Yak

15732 Posts

Posted - 2005-02-23 : 08:13:08

One thing to keep in mind if using Checksum(), it would not work if you wanted to do LIKE matches or other partial matches.

How about using an indexed view? This would reduce the amount of duplicated data, as it would only exist in the index, not as a table unto itself. There would be no work in keeping it up-to-date either.

nathans
Aged Yak Warrior

938 Posts

Posted - 2005-02-23 : 12:49:28

Thanks Michael. I will defintely look into that solution.

Rob, will using indexed views have any performance impact on the table itself? I need to read up on those but it seems that the view would be compiled everytime a row is inserted / modified in the transactional table?

Thanks again guys.

robvolk
Most Valuable Yak

15732 Posts

Posted - 2005-02-23 : 14:04:20

I don't see anything in Books Online that says it would recompile whenever something is added (if it did, it would make no sense to use them). This seems to nail the details:

quote:
Creating a clustered index on a view stores the data as it exists at the time the index is created. An indexed view also automatically reflects modifications made to the data in the base tables after the index is created, the same way an index created on a base table does. As modifications are made to the data in the base tables, the data modifications are also reflected in the data stored in the indexed view. The requirement that the clustered index of the view be unique improves the efficiency with which SQL Server can find the rows in the index that are affected by any data modification.

Indexed views can be more complex to maintain than indexes on base tables. You should create indexes only on views where the improved speed in retrieving results outweighs the increased overhead of making modifications. This usually occurs for views that are mapped over relatively static data, process many rows, and are referenced by many queries.

If you're going to be making lots of updates to existing data, on top of inserting the new stuff, then it might not make sense to use an indexed view for this function.

I'd also envisioned the view using a UNION operator to build it, which is not allowed in an indexed view. If there's no other way of transforming the data then indexed views are a dead end.

nathans
Aged Yak Warrior

938 Posts

Posted - 2005-02-23 : 14:47:11

The existing data is not modified often, though new rows are added fairly frequently (1000 rows /day). But like you said, how can we transform the data without the use of the UNION? Indexed views are out?

Im thinking of going with the original solution. I can just create a job to run during off-peak hours to populate the table with 45 days worth of data, then cross reference on that during new customer inserts. At least this solution will allow me the ability to add/remove searched columns, and increase/decrease the day interval by just modifying the job... but I still feel like there is a better way.

thanks again!

Michael Valentine Jones
Yak DBA Kernel (pronounced Colonel)

7020 Posts

Posted - 2005-02-23 : 18:16:23

You could use a trigger to populate your lookup table with adds, changes, and deletes and just schedule a job once per day to remove old data. There is some overhead with this, but it would keep your lookup table completely up to date.

You could also maintain the lookup table within the stored procedures you are using to maintain the main table, if you are using stored procedures. If the main table is being maintained by many different procedures, the trigger may be easier to implement.

quote:
Originally posted by nathans

The existing data is not modified often, though new rows are added fairly frequently (1000 rows /day). But like you said, how can we transform the data without the use of the UNION? Indexed views are out?

Im thinking of going with the original solution. I can just create a job to run during off-peak hours to populate the table with 45 days worth of data, then cross reference on that during new customer inserts. At least this solution will allow me the ability to add/remove searched columns, and increase/decrease the day interval by just modifying the job... but I still feel like there is a better way.

thanks again!

Codo Ergo Sum

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources