Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
| Author |
Topic |
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2009-08-18 : 18:07:25
|
| What I'm trying to do is develop a process for importing large amounts of data from various sources, flat files, excel, etc., into staging SQL tables where all of the columns will be varchars and then to profile that data and determine what type of data is in each column, i.e. is the data numeric, dates, strings, etc.I know how to do this on a small scale, but I'm nt sure that any of my approaches will scale to the large data sets we will be working with, so is there some kind of standard process or solution for this?For example, if we import a comma delimited file with headers and just have SQL build the table automatically with everything as a varchar(1000) and the dataset has 100 columns and 5 million rows, is there a way to quickly determine the type fo data in each column?For example, column one contains all values that can be converted to int, column two contains chars so will be varchar with amax length of 53, column three contains all values that can be converted to datetime, column four contains decimal values, etc., etc? |
|
|
russell
Pyro-ma-ni-yak
5072 Posts |
Posted - 2009-08-18 : 18:12:20
|
| this is the wrong approach altogether. strong type everything. if the problem is disparate data sources, create a schema for clients to conform to...and publish it and enforce it |
 |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2009-08-18 : 18:28:09
|
quote: Originally posted by russell this is the wrong approach altogether. strong type everything. if the problem is disparate data sources, create a schema for clients to conform to...and publish it and enforce it
Well that's a nonstarter. Of course that is the ideal approach, but that's not the reality. The reality is we are dealing with flat files exported from old mainframe systems that had little or no validations on the data and are tasked with converting the data into new schemas.If the clients had strongly typed data in the first place then there wouldn't be a job for us to do :p |
 |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2009-08-19 : 13:19:32
|
Well this is what I have ended up with. It will only be used on a column by column basis. I stil have some issues with it though: quote: CREATE PROCEDURE sp_GetColumnType @ColumnName varchar(50), @Table varchar(50), @Type varchar(10) OUTPUTWITH RECOMPILEASDECLARE @SQL varchar(max)SET @SQL = 'CREATE PROCEDURE #sp_GetColType @Result varchar(10) OUTPUT AS BEGIN TRY SELECT CAST('+ @ColumnName +' AS datetime) FROM '+ @Table + ' SELECT @Result = ''DateTime'' END TRY BEGIN CATCH BEGIN TRY SELECT CAST('+ @ColumnName +' AS int) FROM '+ @Table + ' SELECT @Result = ''Int'' END TRY BEGIN CATCH BEGIN TRY SELECT CAST('+ @ColumnName +' AS float) FROM ' + @Table + ' SELECT @Result = ''Float'' END TRY BEGIN CATCH SELECT @Result = ''VarChar'' END CATCH END CATCH END CATCH RETURN'SET NOCOUNT ONEXEC(@SQL)EXEC #sp_GetColType @Result = @Type OUTPUTDROP PROC #sp_GetColTypeSET NOCOUNT OFFRETURN
The main problem is that this returns the rows from the select statements, but I thought there was a way within an SP to turn off return of the result sets...Its not pretty, but it was the only thing I could come up with. |
 |
|
|
Vinnie881
Master Smack Fu Yak Hacker
1231 Posts |
Posted - 2009-08-19 : 14:03:24
|
Malachi, I would agree with russell. Take advantage of the sql INTO method and strong type.select convert(datetime,mycol1) as MyDateCol1, convert(int,mycol2) as MyIntCol2,...into MYNewSqlTablefrom.... Any other way you are going to have a really tough time working with. Success is 10% Intelligence, 70% Determination, and 22% Stupidity.\_/ _/ _/\_/ _/\_/ _/ _/- 881 |
 |
|
|
TG
Master Smack Fu Yak Hacker
6065 Posts |
Posted - 2009-08-19 : 14:56:56
|
| I think you guys are missing the question. Correct me if I'm wrong, malachi151, but you will eventually transform all this data to strongly typed permanent tables, right? But you want to develop a process that will "discover" what those types should be based on the data in staging tables? I think it makes sense actually to stage the data into non-typed tables so that the import is very likely to succeed. Deal with dupes, referential integrity, business rules (and even types) after you have the data in sql. The alternative would be to perform those functions prior to getting the data in sql using a different technology.Be One with the OptimizerTG |
 |
|
|
TG
Master Smack Fu Yak Hacker
6065 Posts |
Posted - 2009-08-19 : 15:07:13
|
| I don't think that staging/transformation could work without intervention. There could dates, for instance, stored in wacky formats that could convert to int, decimal, float, binary, etc, but unless you know the context of those values you may not be able to tell they are dates. The actual header text may give you a clue but you couldn't rely on that. Scale and precision could be tricky to figure out as well. The actual values may not be representative of what "thing" they represent.Be One with the OptimizerTG |
 |
|
|
Vinnie881
Master Smack Fu Yak Hacker
1231 Posts |
Posted - 2009-08-19 : 15:59:14
|
I wasn't disputing staging the data, I was disputing using a function/procedure to determine the type of data. Success is 10% Intelligence, 70% Determination, and 22% Stupidity.\_/ _/ _/\_/ _/\_/ _/ _/- 881 |
 |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2009-08-19 : 16:40:57
|
quote: Originally posted by TG I think you guys are missing the question. Correct me if I'm wrong, malachi151, but you will eventually transform all this data to strongly typed permanent tables, right? But you want to develop a process that will "discover" what those types should be based on the data in staging tables? I think it makes sense actually to stage the data into non-typed tables so that the import is very likely to succeed. Deal with dupes, referential integrity, business rules (and even types) after you have the data in sql. The alternative would be to perform those functions prior to getting the data in sql using a different technology.Be One with the OptimizerTG
Correct. This is all going into strongly typed tables eventually, this is just the first stage of mapping and discovery. We already know that much of the data is bad. Trying to convert it directly will fail every time.We have a defined schema that we are moving the data to. What we want to see is where the existing data won't directly convert over. For example if we have an address field that is defined as nvarchar(50), but we see that in their data they have addresses as long as 62 chars, etc., or if we have a field that is an integer, but the field that maps to is has chars in it, like an ID number column, etc., or a field that we have as decimal (4,2), but there are values that have more significant digits, or if they have invalid dates (some sources just used text fields to collect data info, etc).So, doing a direct convert isn't even a possibilty.The process we use is to move everything into tables where every column is a varchar(1000), then profile the data there, do mapping to our schema, and cleanup the data, much of which gets either normalized or denormalized in the process and also much of which gets converted into derived fields using formulas to derive new values from the supplied data (everything from mathematical derivations to stuff like breaking whole names into first, middle, last, etc.)What we want to be able to easily see are the discrepancies between the source data and the target schema, i.e. XYZ source column is mapped to ABC destination column, which is an integer type, but YXZ column contains character data, etc.What I have now basically works, I was just wondering if there was some other, possibly better, way to do it. |
 |
|
|
TG
Master Smack Fu Yak Hacker
6065 Posts |
Posted - 2009-08-19 : 17:31:24
|
| couple observations:What you've just described sounds different than what the SP posted above does. Sounds like you want to check the data (column by column, table by table, row by row) for any values that either won't convert to the target type or where you may loose some precision, or be truncated. so if your target type was say decimal(10,2) but one of your values was 2.345 then you want to see that value to decide if you need to change your precision or just allow it to be rounded to 2 decimals. Where as the SP above would just return it's best guess at a datatype based on all the values in the table. Then if your resulting type was different than the target table type you would investigate further? As written, that SP's output will only be based off of the last value it encountered. In other words in a 5 million row table, @result will be set 5 million times and only the result from the last value evaluated will be returned - so that doesn't sound like a good solution.Be One with the OptimizerTG |
 |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2009-08-20 : 09:30:48
|
quote: Originally posted by TG couple observations:What you've just described sounds different than what the SP posted above does. Sounds like you want to check the data (column by column, table by table, row by row) for any values that either won't convert to the target type or where you may loose some precision, or be truncated. so if your target type was say decimal(10,2) but one of your values was 2.345 then you want to see that value to decide if you need to change your precision or just allow it to be rounded to 2 decimals. Where as the SP above would just return it's best guess at a datatype based on all the values in the table. Then if your resulting type was different than the target table type you would investigate further? As written, that SP's output will only be based off of the last value it encountered. In other words in a 5 million row table, @result will be set 5 million times and only the result from the last value evaluated will be returned - so that doesn't sound like a good solution.Be One with the OptimizerTG
It shouldn't be based off the last value, it should be based on any value encountered. If its a row with date values, but some are wrong, once it hits a wrong value an exception will be thrown, going to the catch, it will try and fail to convert to int, then try and fail to convert to float, then it will report back a varchar.I could test it out on a row with invalid values, but a valid last row to see if what you are saying is correct, but that isn't how I understood it to work, and the one I posted is just a preliminary test one that isn't as detailed on the types etc. |
 |
|
|
TG
Master Smack Fu Yak Hacker
6065 Posts |
Posted - 2009-08-20 : 10:51:12
|
I doubt this is the behavior you want. In this case I called the (test) SP you posted above with a column of INTs with one datetime thrown in there.create table junk (i varchar(500))insert junkselect '2' union allselect '2' union allselect '2' union allselect '2' union allselect '2009-12-31' union allselect '2' union allselect '2' declare @t varchar(10)exec sp_GetColumnType@ColumnName = 'i',@Table = 'junk',@Type = @t OUTPUTselect @tdrop table junkOUTPUT:----------------------------------2222----------------------2222----------VarChar I also don't think it is necessary to create, call, and drop an SP in an SP. There are better ways but the question is, What output would you like to see for this example? What output would you like to see for a table with a say 200 dates, 200 ints, and 200 varchars?Be One with the OptimizerTG |
 |
|
|
TG
Master Smack Fu Yak Hacker
6065 Posts |
Posted - 2009-08-20 : 11:09:35
|
Perhaps what you're looking for is an analysis of values types by column for a table. ie something like:[<TableName>]ColumnName datetimeCount IntCount moneyCount varcharCountCol1 0 0 500000 10Col2 500010 0 0 0Col2 0 0 0 500010 What I usually do on large data projects is create the staging table(s) like you have but then write big ole transformation scripts where table by table column by column I do the transformations. As I discover outliers, poorly formed values, and orphaned rows, etc. I script in the appropriate fixes as I go. It is time consuming but, I believe, necessary. If I need to save off data that I couldn't fix for some product analysis and later integration then so be it.Be One with the OptimizerTG |
 |
|
|
malachi151
Posting Yak Master
152 Posts |
Posted - 2009-08-20 : 11:37:24
|
| Yep, VarChar is what I expect to see in the example given.This isn't for detailed analysis, its for a first pass. Lets us see immediantely which columns match up okay and which ones don't.If there is a way to do with without creating the temp SP I'm all for it :)what I really want to do is turn off the result set so that the rows from the selects are not returned at all. I thought there was a way to do that, but I dunno.Your later suggestion is good too, my concern is just that it would take too long.How would something that like result be produced, with a Cursor? I suppose we could let something like that run over night....Edit:Maybe a better way to think of what my code does is it tells you what is the most strongly typed column you can put the data into without losing any data. |
 |
|
|
TG
Master Smack Fu Yak Hacker
6065 Posts |
Posted - 2009-08-20 : 13:45:45
|
>>Yep, VarChar is what I expect to see in the example given.Not sure why if there is only Datetimes and Ints in the sample table ?>>If there is a way to do with without creating the temp SP I'm all for it :)Here is one way that just uses your existing logic:use tempdbif object_id('sp_GetColumnType') > 0 drop proc sp_GetColumnTypegoCREATE PROCEDURE sp_GetColumnType @ColumnName varchar(50), @Table varchar(50), @Type varchar(10) OUTPUTASDECLARE @SQL nvarchar(max) ,@p nvarchar(max) ,@Result varchar(10)BEGIN TRY set @sql = N'select @result = ''datetime'' from ' + @table + ' where convert(datetime, ' + @columnName + ') = convert(datetime, ' + @columnName + ')' set @p = N'@result varchar(10) output' exec sp_executesql @sql, @p, @result = @result outputEND TRYBEGIN CATCH BEGIN TRY set @sql = N'select @result = ''int'' from ' + @table + ' where convert(int, ' + @columnName + ') = convert(int, ' + @columnName + ')' set @p = N'@result varchar(10) output' exec sp_executesql @sql, @p, @result = @result output END TRY BEGIN CATCH BEGIN TRY set @sql = N'select @result = ''float'' from ' + @table + ' where convert(float, ' + @columnName + ') = convert(float, ' + @columnName + ')' set @p = N'@result varchar(10) output' exec sp_executesql @sql, @p, @result = @result output END TRY BEGIN CATCH set @result = 'varchar' END CATCH END CATCHEND CATCHset @type = @resultgocreate table junk (i varchar(500))set nocount oninsert junkselect '2' union allselect '2' union allselect '2' union allselect '2' union allselect '2009-12-31' union allselect '2' union allselect '2' set nocount offdeclare @t varchar(10)exec sp_GetColumnType@ColumnName = 'i',@Table = 'junk',@Type = @t OUTPUTselect @t [Type]drop table junkdrop proc sp_GetColumnTypeOUTPUT:Type----------varchar>>Your later suggestion is good too, my concern is just that it would take too long.>>How would something that like result be produced, with a Cursor? I suppose we could let something like that run over night....Here is one way to do that:EDIT:You obviously would need to beef up the logic to determine the types but this is the idea.END_EDITcreate table junk (i varchar(500))set nocount oninsert junkselect '2' union allselect '2' union allselect '2' union allselect '2' union allselect '2009-12-31' union allselect '2' union allselect '2' set nocount offgoset nocount ondeclare @table varchar(50)set @table = 'junk'create table #t (columnName varchar(50), datetimecount int, intcount int, varcharCount int)declare @col varchar(50) ,@i int ,@Count int ,@sql nvarchar(2000) ,@p nvarchar(2000)select @i = min(ordinal_position) from information_schema.columns where table_name = @tablewhile @i is not nullbegin select @col = column_name from information_schema.columns where table_name = @table and ordinal_position = @i ----------------------------------------------------------------------------------------------------- --get datetime count set @sql = N'select @c = count(*) from ' + @table + ' where isDate([' + @col + ']) = 1' set @p = N'@c int output' exec sp_executesql @sql, @p, @c = @count output insert #t (ColumnName, datetimecount) values (@col, @count) ----------------------------------------------------------------------------------------------------- --get int count select @sql = N'select @c = count(*) from ' + @table + ' where [' + @col + '] not like ''%[^0-9]%''' ,@p = N'@c int output' exec sp_executesql @sql, @p, @c = @count output update #t set intcount = @count where columnName = @col ----------------------------------------------------------------------------------------------------- --get varchar count select @sql = N'select @c = count(*) from ' + @table + ' where [' + @col + '] like ''%[a-z]%''' ,@p = N'@c int output' exec sp_executesql @sql, @p, @c = @count output update #t set varcharcount = @count where columnName = @col ----------------------------------------------------------------------------------------------------- select @i = min(ordinal_position) from information_schema.columns where table_name = @table and ordinal_position > @iendselect @table [Table]select * from #tdrop table #tgodrop table junkOUTPUT:Table--------------------------------------------------junkcolumnName datetimecount intcount varcharCount-------------------------------------------------- ------------- ----------- ------------i 1 6 0 >>Maybe a better way to think of what my code does is it tells you what is the most strongly typed column you can put the data into without losing any data.I don't really like that idea. Yuu should base the types on what the values represent not what the source data looks like.Be One with the OptimizerTG |
 |
|
|
|
|
|
|
|