Way to determine data types of raw data....

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

SQL Server 2005 Forums

Transact-SQL (2005)

Way to determine data types of raw data....

Author

Topic

malachi151
Posting Yak Master

152 Posts

Posted - 2009-08-18 : 18:07:25

What I'm trying to do is develop a process for importing large amounts of data from various sources, flat files, excel, etc., into staging SQL tables where all of the columns will be varchars and then to profile that data and determine what type of data is in each column, i.e. is the data numeric, dates, strings, etc.

I know how to do this on a small scale, but I'm nt sure that any of my approaches will scale to the large data sets we will be working with, so is there some kind of standard process or solution for this?

For example, if we import a comma delimited file with headers and just have SQL build the table automatically with everything as a varchar(1000) and the dataset has 100 columns and 5 million rows, is there a way to quickly determine the type fo data in each column?

For example, column one contains all values that can be converted to int, column two contains chars so will be varchar with amax length of 53, column three contains all values that can be converted to datetime, column four contains decimal values, etc., etc?

russell
Pyro-ma-ni-yak

5072 Posts

Posted - 2009-08-18 : 18:12:20

this is the wrong approach altogether. strong type everything. if the problem is disparate data sources, create a schema for clients to conform to...and publish it and enforce it

malachi151
Posting Yak Master

152 Posts

Posted - 2009-08-18 : 18:28:09

quote:
Originally posted by russell

this is the wrong approach altogether. strong type everything. if the problem is disparate data sources, create a schema for clients to conform to...and publish it and enforce it

Well that's a nonstarter. Of course that is the ideal approach, but that's not the reality. The reality is we are dealing with flat files exported from old mainframe systems that had little or no validations on the data and are tasked with converting the data into new schemas.

If the clients had strongly typed data in the first place then there wouldn't be a job for us to do :p

malachi151
Posting Yak Master

152 Posts

Posted - 2009-08-19 : 13:19:32

Well this is what I have ended up with. It will only be used on a column by column basis. I stil have some issues with it though:

quote:
CREATE PROCEDURE sp_GetColumnType
@ColumnName varchar(50),
@Table varchar(50),
@Type varchar(10) OUTPUT
WITH RECOMPILE
AS

DECLARE @SQL varchar(max)

SET @SQL =
'CREATE PROCEDURE #sp_GetColType
@Result varchar(10) OUTPUT
AS
BEGIN TRY
SELECT CAST('+ @ColumnName +' AS datetime) FROM '+ @Table + '
SELECT @Result = ''DateTime''
END TRY
BEGIN CATCH
BEGIN TRY
SELECT CAST('+ @ColumnName +' AS int) FROM '+ @Table + '
SELECT @Result = ''Int''
END TRY
BEGIN CATCH
BEGIN TRY
SELECT CAST('+ @ColumnName +' AS float) FROM ' + @Table + '
SELECT @Result = ''Float''
END TRY
BEGIN CATCH
SELECT @Result = ''VarChar''
END CATCH
END CATCH
END CATCH
RETURN'

SET NOCOUNT ON
EXEC(@SQL)

EXEC #sp_GetColType @Result = @Type OUTPUT

DROP PROC #sp_GetColType

SET NOCOUNT OFF

RETURN

The main problem is that this returns the rows from the select statements, but I thought there was a way within an SP to turn off return of the result sets...

Its not pretty, but it was the only thing I could come up with.

Vinnie881
Master Smack Fu Yak Hacker

1231 Posts

Posted - 2009-08-19 : 14:03:24

Malachi, I would agree with russell. Take advantage of the sql INTO method and strong type.



select convert(datetime,mycol1) as MyDateCol1, convert(int,mycol2) as MyIntCol2,...
into MYNewSqlTable
from
....

Any other way you are going to have a really tough time working with.

Success is 10% Intelligence, 70% Determination, and 22% Stupidity.
\_/ _/ _/\_/ _/\_/ _/ _/- 881

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-19 : 14:56:56

I think you guys are missing the question. Correct me if I'm wrong, malachi151, but you will eventually transform all this data to strongly typed permanent tables, right? But you want to develop a process that will "discover" what those types should be based on the data in staging tables? I think it makes sense actually to stage the data into non-typed tables so that the import is very likely to succeed. Deal with dupes, referential integrity, business rules (and even types) after you have the data in sql. The alternative would be to perform those functions prior to getting the data in sql using a different technology.

Be One with the Optimizer
TG

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-19 : 15:07:13

I don't think that staging/transformation could work without intervention. There could dates, for instance, stored in wacky formats that could convert to int, decimal, float, binary, etc, but unless you know the context of those values you may not be able to tell they are dates. The actual header text may give you a clue but you couldn't rely on that. Scale and precision could be tricky to figure out as well. The actual values may not be representative of what "thing" they represent.

Be One with the Optimizer
TG

Vinnie881
Master Smack Fu Yak Hacker

1231 Posts

Posted - 2009-08-19 : 15:59:14

I wasn't disputing staging the data, I was disputing using a function/procedure to determine the type of data.

Success is 10% Intelligence, 70% Determination, and 22% Stupidity.
\_/ _/ _/\_/ _/\_/ _/ _/- 881

malachi151
Posting Yak Master

152 Posts

Posted - 2009-08-19 : 16:40:57

quote:
Originally posted by TG

I think you guys are missing the question. Correct me if I'm wrong, malachi151, but you will eventually transform all this data to strongly typed permanent tables, right? But you want to develop a process that will "discover" what those types should be based on the data in staging tables? I think it makes sense actually to stage the data into non-typed tables so that the import is very likely to succeed. Deal with dupes, referential integrity, business rules (and even types) after you have the data in sql. The alternative would be to perform those functions prior to getting the data in sql using a different technology.

Be One with the Optimizer
TG

Correct. This is all going into strongly typed tables eventually, this is just the first stage of mapping and discovery.

We already know that much of the data is bad. Trying to convert it directly will fail every time.

We have a defined schema that we are moving the data to. What we want to see is where the existing data won't directly convert over. For example if we have an address field that is defined as nvarchar(50), but we see that in their data they have addresses as long as 62 chars, etc., or if we have a field that is an integer, but the field that maps to is has chars in it, like an ID number column, etc., or a field that we have as decimal (4,2), but there are values that have more significant digits, or if they have invalid dates (some sources just used text fields to collect data info, etc).

So, doing a direct convert isn't even a possibilty.

The process we use is to move everything into tables where every column is a varchar(1000), then profile the data there, do mapping to our schema, and cleanup the data, much of which gets either normalized or denormalized in the process and also much of which gets converted into derived fields using formulas to derive new values from the supplied data (everything from mathematical derivations to stuff like breaking whole names into first, middle, last, etc.)

What we want to be able to easily see are the discrepancies between the source data and the target schema, i.e. XYZ source column is mapped to ABC destination column, which is an integer type, but YXZ column contains character data, etc.

What I have now basically works, I was just wondering if there was some other, possibly better, way to do it.

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-19 : 17:31:24

couple observations:

What you've just described sounds different than what the SP posted above does. Sounds like you want to check the data (column by column, table by table, row by row) for any values that either won't convert to the target type or where you may loose some precision, or be truncated. so if your target type was say decimal(10,2) but one of your values was 2.345 then you want to see that value to decide if you need to change your precision or just allow it to be rounded to 2 decimals.

Where as the SP above would just return it's best guess at a datatype based on all the values in the table. Then if your resulting type was different than the target table type you would investigate further? As written, that SP's output will only be based off of the last value it encountered. In other words in a 5 million row table, @result will be set 5 million times and only the result from the last value evaluated will be returned - so that doesn't sound like a good solution.

Be One with the Optimizer
TG

malachi151
Posting Yak Master

152 Posts

Posted - 2009-08-20 : 09:30:48

quote:
Originally posted by TG

couple observations:

What you've just described sounds different than what the SP posted above does. Sounds like you want to check the data (column by column, table by table, row by row) for any values that either won't convert to the target type or where you may loose some precision, or be truncated. so if your target type was say decimal(10,2) but one of your values was 2.345 then you want to see that value to decide if you need to change your precision or just allow it to be rounded to 2 decimals.

Where as the SP above would just return it's best guess at a datatype based on all the values in the table. Then if your resulting type was different than the target table type you would investigate further? As written, that SP's output will only be based off of the last value it encountered. In other words in a 5 million row table, @result will be set 5 million times and only the result from the last value evaluated will be returned - so that doesn't sound like a good solution.

Be One with the Optimizer
TG

It shouldn't be based off the last value, it should be based on any value encountered. If its a row with date values, but some are wrong, once it hits a wrong value an exception will be thrown, going to the catch, it will try and fail to convert to int, then try and fail to convert to float, then it will report back a varchar.

I could test it out on a row with invalid values, but a valid last row to see if what you are saying is correct, but that isn't how I understood it to work, and the one I posted is just a preliminary test one that isn't as detailed on the types etc.

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-20 : 10:51:12

I doubt this is the behavior you want. In this case I called the (test) SP you posted above with a column of INTs with one datetime thrown in there.


create table junk (i varchar(500))

insert junk
select '2' union all
select '2' union all
select '2' union all
select '2' union all
select '2009-12-31' union all
select '2' union all
select '2' 


declare @t varchar(10)
exec sp_GetColumnType
@ColumnName = 'i',
@Table = 'junk',
@Type = @t OUTPUT
select @t

drop table junk

OUTPUT:
-----------------------


-----------
2
2
2
2


----------------------
2
2
2
2


----------
VarChar

I also don't think it is necessary to create, call, and drop an SP in an SP. There are better ways but the question is, What output would you like to see for this example? What output would you like to see for a table with a say 200 dates, 200 ints, and 200 varchars?

Be One with the Optimizer
TG

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-20 : 11:09:35

Perhaps what you're looking for is an analysis of values types by column for a table. ie something like:


[<TableName>]
ColumnName     datetimeCount IntCount      moneyCount    varcharCount
Col1                      0         0         500000               10
Col2                 500010         0              0                0
Col2                      0         0              0           500010

What I usually do on large data projects is create the staging table(s) like you have but then write big ole transformation scripts where table by table column by column I do the transformations. As I discover outliers, poorly formed values, and orphaned rows, etc. I script in the appropriate fixes as I go. It is time consuming but, I believe, necessary. If I need to save off data that I couldn't fix for some product analysis and later integration then so be it.

Be One with the Optimizer
TG

malachi151
Posting Yak Master

152 Posts

Posted - 2009-08-20 : 11:37:24

Yep, VarChar is what I expect to see in the example given.

This isn't for detailed analysis, its for a first pass. Lets us see immediantely which columns match up okay and which ones don't.

If there is a way to do with without creating the temp SP I'm all for it :)

what I really want to do is turn off the result set so that the rows from the selects are not returned at all. I thought there was a way to do that, but I dunno.

Your later suggestion is good too, my concern is just that it would take too long.

How would something that like result be produced, with a Cursor? I suppose we could let something like that run over night....

Edit:
Maybe a better way to think of what my code does is it tells you what is the most strongly typed column you can put the data into without losing any data.

TG
Master Smack Fu Yak Hacker

6065 Posts

Posted - 2009-08-20 : 13:45:45

>>Yep, VarChar is what I expect to see in the example given.
Not sure why if there is only Datetimes and Ints in the sample table ?

>>If there is a way to do with without creating the temp SP I'm all for it :)
Here is one way that just uses your existing logic:


use tempdb
if object_id('sp_GetColumnType') > 0
       drop proc sp_GetColumnType
go
CREATE PROCEDURE sp_GetColumnType
       @ColumnName varchar(50),
       @Table varchar(50),
       @Type varchar(10) OUTPUT
AS

DECLARE @SQL nvarchar(max)
       ,@p nvarchar(max)
       ,@Result varchar(10)

BEGIN TRY
       set @sql = N'select @result = ''datetime'' from ' + @table + ' where convert(datetime, ' + @columnName + ') = convert(datetime, ' + @columnName + ')'
       set @p = N'@result varchar(10) output'
       exec sp_executesql @sql, @p, @result = @result output
END TRY
BEGIN CATCH
       BEGIN TRY
              set @sql = N'select @result = ''int'' from ' + @table + ' where convert(int, ' + @columnName + ') = convert(int, ' + @columnName + ')'
              set @p = N'@result varchar(10) output'
              exec sp_executesql @sql, @p, @result = @result output
       END TRY
       BEGIN CATCH
              BEGIN TRY
                     set @sql = N'select @result = ''float'' from ' + @table + ' where convert(float, ' + @columnName + ') = convert(float, ' + @columnName + ')'
                     set @p = N'@result varchar(10) output'
                     exec sp_executesql @sql, @p, @result = @result output
              END TRY
              BEGIN CATCH
                     set @result = 'varchar'
              END CATCH
       END CATCH
END CATCH

set @type = @result

go

create table junk (i varchar(500))
set nocount on
insert junk
select '2' union all
select '2' union all
select '2' union all
select '2' union all
select '2009-12-31' union all
select '2' union all
select '2' 
set nocount off

declare @t varchar(10)
exec sp_GetColumnType
@ColumnName = 'i',
@Table = 'junk',
@Type = @t OUTPUT
select @t [Type]

drop table junk
drop proc sp_GetColumnType

OUTPUT:
Type
----------
varchar

>>Your later suggestion is good too, my concern is just that it would take too long.
>>How would something that like result be produced, with a Cursor? I suppose we could let something like that run over night....
Here is one way to do that:

EDIT:
You obviously would need to beef up the logic to determine the types but this is the idea.
END_EDIT


create table junk (i varchar(500))
set nocount on
insert junk
select '2' union all
select '2' union all
select '2' union all
select '2' union all
select '2009-12-31' union all
select '2' union all
select '2' 
set nocount off

go
set nocount on
declare @table varchar(50)
set @table = 'junk'

create table #t (columnName varchar(50), datetimecount int, intcount int, varcharCount int)

declare @col varchar(50)
       ,@i int
       ,@Count int
       ,@sql  nvarchar(2000)
       ,@p    nvarchar(2000)

select @i = min(ordinal_position) from information_schema.columns where table_name = @table
while @i is not null
begin
       select @col = column_name from  information_schema.columns where table_name = @table and ordinal_position = @i

       -----------------------------------------------------------------------------------------------------
       --get datetime count       
       set @sql = N'select @c = count(*) from ' + @table + ' where isDate([' + @col + ']) = 1'
       set @p = N'@c int output'
       exec sp_executesql @sql, @p, @c = @count output
       
       insert #t (ColumnName, datetimecount)
       values (@col, @count)

       -----------------------------------------------------------------------------------------------------
       --get int count
       select @sql = N'select @c = count(*) from ' + @table + ' where [' + @col + '] not like ''%[^0-9]%'''
              ,@p = N'@c int output'
       exec sp_executesql @sql, @p, @c = @count output

       update #t set intcount = @count where columnName = @col

       -----------------------------------------------------------------------------------------------------
       --get varchar count
       select @sql = N'select @c = count(*) from ' + @table + ' where [' + @col + '] like ''%[a-z]%'''
              ,@p = N'@c int output'
       exec sp_executesql @sql, @p, @c = @count output

       update #t set varcharcount = @count where columnName = @col
       -----------------------------------------------------------------------------------------------------

       select @i = min(ordinal_position) from information_schema.columns where table_name = @table and ordinal_position > @i
end

select @table [Table]
select * from #t
drop table #t

go
drop table junk

OUTPUT:
Table
--------------------------------------------------
junk

columnName                                         datetimecount intcount    varcharCount
-------------------------------------------------- ------------- ----------- ------------
i                                                  1             6           0

>>Maybe a better way to think of what my code does is it tells you what is the most strongly typed column you can put the data into without losing any data.

I don't really like that idea. Yuu should base the types on what the values represent not what the source data looks like.

Be One with the Optimizer
TG

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources