Automatic column normalization and dummy variables

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2012 Forums

Transact-SQL (2012)

Automatic column normalization and dummy variables

Author

Topic

analyst79
Starting Member

1 Post

Posted - 2013-02-07 : 15:21:50

A friend recommended I post to this community, saying it was both fast an excellent. So here I am.

I am struggling with a business process of transforming data for analysis that I would like to handle in SQL. I have a CSV file that contains numerical and categorical information. The categorical columns would ideally be identified automatically by containing non-numeric data, with an optional list of columns that would be interpreted as categorical even if a numeric interpretation was possible (such as a list of accounting codes). Before loading the CSV, I will not know how many unique values exist in a column. I may have to list out those columns manually, however. Once the columns are identified as categorical, a new table is created for each column, containing that column's distinct values and an ID column. The categorical column in the original data set is then replaced by a series of columns containing Boolean values for each original value in the column. I will also end up scaling numerical columns so that the largest absolute value is 1. Any help as to code samples or a strategy for solving this would be greatly appreciated. Thanks!

Example:
data.csv:
Category,Balance
A,1000
B,2400
C,800

Raw Data table:
Category,Balance
A,1000
B,-2400
C,800

Category table:
Category_id,Category_value
1,A
2,B
3,C

Normalization table:
Column,Factor
Category,2400

Processed Data table:
Category__A,Category__B,Category__C,Sales__norm
1,0,0,0.4166666666666667
0,1,0,-1
0,0,1,0.3333333333333333

LoztInSpace
Aged Yak Warrior

940 Posts

Posted - 2013-02-07 : 20:22:45

You typically don't want to be doing that. You already have all the data you need to work it all out. The rest is a presentation issue, not a data issue.
If you're using SSRS, there are matrix reports that pretty much cover what you are tring to do. A relational database is not the tool for this job. It's just not how they work.

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources