Article: Parsing CSV Values Into Multiple Rows

Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

All Forums

Site Related Forums

Article Discussion

Article: Parsing CSV Values Into Multiple Rows

Author

Topic

AskSQLTeam
Ask SQLTeam Question

0 Posts

Posted - 2001-02-14 : 20:59:55

There have been quite a few requests on how to take a string of comma-separated values (CSV), and parse out the values individually. graz's article uses a stored procedure to extract each value into a variable. However, if you have a table with a column of CSV values, and you want to translate them into a normalized table, this technique becomes pretty hairy. You would need to use a cursor to loop through your CSV table and process each row. (A loop within another loop...I'm getting queasy)

Fortunately there's a way to do this using just a regular SELECT. It requires a table of numeric values, but it does the job beautifully; you can INSERT normalized CSV values into a table with one SQL statement!

Article Link.

AjarnMark
SQL Slashing Gunting Master

3246 Posts

Posted - 2002-08-19 : 20:24:26

Rob,

Thanks for a GREAT article! I have referred several people here in the past, but it wasn't until just today that I needed to use the technique myself. In fact, I wrapped some other code around it such that I was able to actually process two correlated CSV arrays and insert their values into a table such that the pairings were together in their own rows. Worked like a charm! Thanks again!

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-09-18 : 14:50:44

Hey Rob, I have spent the majority of the morning working on adapting this to a situation where the separator is variable. I'm stuck ...


declare @csv varchar(100), @sep varchar(5)
select @csv = 'this, , is, a, test', @sep = ', '
select 
    nullif(substring(@sep+@csv+@sep,n+len(@sep),charindex(@sep,@sep+@csv+@sep,n)-n-len(@sep)),'') as word
from
    numbers
where
    n<=len(@sep+@csv+@sep) and
    substring(@sep+@csv+@sep,n-1,1)=@sep and
    charindex(@sep,@sep+@csv+@sep,n)-n-len(@sep)>0

This works great, until I change the seperator!! Any advice?

Jay White
{0}

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-09-18 : 15:14:17

Does this fix it?

declare @csv varchar(100), @sep varchar(5)
select @csv = 'this, , is, a, test', @sep = ', '
select
nullif(substring(@sep+@csv+@sep,n+len(@sep), charindex(@sep,@sep+@csv+@sep,n)-n-len(@sep)),'') as word
from
numbers
where
n<=len(@sep+@csv+@sep) and
substring(@sep+@csv+@sep,n-1,len(@sep))=@sep and
charindex(@sep,@sep+@csv+@sep,n)-n-len(@sep)>0

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-09-18 : 15:27:10

quote:

LEN (T-SQL)
Returns the number of characters, rather than the number of bytes, of the given string expression, excluding trailing blanks.

...so...


declare @csv varchar(100), @sep varchar(5)
select @csv = 'thishellhellishellahelltest', @sep = 'hell'
select 
    nullif(substring(@sep+@csv+@sep,n,charindex(@sep,@sep+@csv+@sep,n)-n),'') as word
from
    numbers
where
    n<=len(@sep+@csv+@sep) and
    substring(@sep+@csv+@sep,n-datalength(@sep),datalength(@sep))=@sep and
    charindex(@sep,@sep+@csv+@sep,n)-n>0

I think the only thing that needs to change (from the original article) is the where filter that picks the n. The line the changes is red above.

Pardon me, while I go whittle my keyboard into the shape of a spike and fall on it ... datalength ... eff me

Jay White
{0}

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-09-18 : 15:42:01

Oooooooooooooops, yeah, that sounds like that might've been the problem.

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-09-20 : 12:54:13

I did some testing between Rob's method in this article and Graz's sp_parsearray proc, and got some interesting results that I though I might share ...

First off, I create a table of random length (up to 10 characters) strings. I then wrote a proc that uses this COALESCE method for building a csv string out of a variable number of random records from my strings table. I then converted my code snippet above (adapted from Rob's article) and Graz'a proc into UDFs for the testing.

For the testing, I ran each function in a while loop while @i<=800 and created the csv out of @i random elements from strings. I then passed that csv to my two parser functions and captured execution time metrics (simple set @t = getdate() before the execution and captured datediff(ms,@t,getdate()) after it) ... I ran through this loop several times to get a decent population.

First things first. On the average, Rob's method is a) a good bit faster and b) more consistant ....


method | Avg(ms/element) | Stdev
----------------------------------
Rob    | 0.87404         | 0.92330
Graz   | 2.05991         | 1.98419

One thing to note, though. The couple of outliers in the testing of Rob's method were really out there. For example, when testing a csv of 684 random elements (of random length), Graz's method scored {470,480,490,503,523} and Rob's scored {200,210,220,220,1240}. There are a handful of these cases (not enough to really effect STDEV, but interesting nontheless) ...

All tests run on a Dell laptop, XP Pro, SQL 2Ksp2, 1.7Gig P4 . . .
Scripts available on request ...

Jay White
{0}

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-09-20 : 13:29:17

If I had to guess, I'd say that the sequence table might cache differently from one execution to the next. Some of the CSVs you're parsing might require an extra data page or two to be retrieved...especially if that table spans more than one extent. You did clear the data cache between each run, right????

(DBCC DROPCLEANBUFFERS)

graz's code has a huge benefit because it works with scalar, singular values and doesn't require any I/O or caching beyond whatever the optimizer might do in creating work tables, and that's highly unlikely. There could also be subtle differences relating to PatIndex (graz) and CharIndex (mine) performance. Have you tried clearing the procedure cache between each run too?

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-09-20 : 13:51:20

The sequence table used is a real table, and not a temp table/table variable generated on each run ...

I'm running it again (just 3 iterations this time) clearing both the data and proc cache before each call to each function. I don't think it will change the results, but I'll post back in about 30 minutes ...

Jay White
{0}

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-09-20 : 14:10:13

quote:
The sequence table used is a real table, and not a temp table/table variable generated on each run ...

Wouldn't matter, it would depend entirely on how the table is physically stored. Especially if it spanned multiple extents, a set of values needed to process one run may exist on one page/extent, and another run may require rows on another page/extent not previously read into cache, and would need some extra I/O to retrieve it.

Again, that's just a guess, but it's the kind of thing that can make the same code and the same data run differently on different machines.

Page47
Master Smack Fu Yak Hacker

2878 Posts

Posted - 2002-09-20 : 15:01:00

Rob you were right ...

With clearing data and proc buffers between each run ...


method | Avg(ms/element) | Stdev
----------------------------------
Rob    | 2.47138         | 12.9338
Graz   | 1.45296         | 7.08893

I am curious as to how and why Graz's method improved ...

Jay White
{0}

robvolk
Most Valuable Yak

15732 Posts

Posted - 2002-09-20 : 15:21:34

The earlier, better results for my method benefited from data caching.

Possibly, graz's code was hampered earlier by an old(er) execution plan kept in cache; recompiling each time might've helped, but again I think data caching played the biggest role here. And yes, the convential wisdom that "recompiles are always bad" isn't always true.

And if you created a new set of data, physical storage differences could completely skew the results compared to the earlier run.

Don't you just HATE how that physical implementation shit keeps f---ing up a perfectly good theory?

BTW, you have reached the segment of the performance evaluation process I lovingly call the WTF??? Stage. I have JOYOUS (not!) memories of running the same friggin' code HUNDREDS of times and getting wildly different results each time. The only solution I found to be reliable was DBCC DROPCLEANBUFFERS and FREEPROCCACHE and FLUSHPROCINDB before each run. Since all of these were undocumented until I got Ken Henderson's

The Guru's Guide to Transact-SQL

you can understand a little better why I exalt Ken above all other SQL Server authors.

Without him I'd be an even bigger babbling drooling idiot than I am today (shut up Femenella!)

nguyen
Starting Member

8 Posts

Posted - 2003-08-26 : 11:54:36

I tinkered with the code and used derived tables, to come up with a slightly cleaner version. Here Quotes contains only one row. The derived table determines s (position of the starting comma) and e (position of the ending comma) for each word. The rows are filtered by where the ending comma position is greater than the starting comma position.

select
word=substring(phrase,s,(e-s-1))
from
(
SELECT id
,phrase
,s=CharIndex(',' ,','+Phrase+',', ID)
,e=CharIndex(',' ,','+Phrase+',', ID+1)
FROM Tally,Quotes
where CharIndex(',' ,','+Phrase+',', ID) < CharIndex(',' ,','+Phrase+',', ID+1)
) as a

AndrewGail
Starting Member

3 Posts

Posted - 2004-03-26 : 09:51:33

Hi,

This site has quickly become my favourite SQL related site, and for good reason. I've used your code found in this article to great effect before, but I've spent all day trying to rework it to achieve a slightly different goal.

I understand that putting things in context helps, so excuse the long winded post.

I've an old SQL Server database who's main table stores links and link titles (amoungst other things). The links/titles are stored as follows:

LINKS
-------
http://www.bbc.co.uk,http://www.sqlteam.com,http://www.google.com

LINKTITLES
------------
BBC Online,SQLTeam,Google

I'm upgrading the web applications that use these fields and at the same time I want to put these two fields into their own table:

Table: Links
--------------
UIN - int - {Unique Identifier}
Parent - int - {reference to UIN from main table}
Title - nvarchar - {the link title}
Link - nvarchar - {the link itself}

When I run your code against a single column (e.g. linktitles) it all works perfectly returning 1033 rows.

I then tried a modification of your code:


SELECT UIN, 
NullIf(SubString(',' + LINKTITLES + ',' , ID , CharIndex(',' , ',' + LINKTITLES + ',' , ID) - ID) , '') AS Title, NullIf(SubString(',' + LINKS + ',' , ID , CharIndex(',' , ',' + LINKS + ',' , ID) - ID) , '') AS Link
FROM #Tally, linksTemp
WHERE (ID <= Len(',' + LINKTITLES + ',') AND SubString(',' + LINKTITLES + ',' , ID - 1, 1) = ',') AND (ID <= Len(',' + LINKS + ',') AND SubString(',' + LINKS + ',' , ID - 1, 1) = ',')
AND CharIndex(',' , ',' + LINKTITLES + ',' , ID) - ID > 0
ORDER BY UIN

It works. But its only returning 630 rows. I'm only getting one link/title per UIN.

Could anyone shed some light on where I'm going wrong here?

Thanks,
Andrew

ESquared
Starting Member

4 Posts

Posted - 2005-08-29 : 00:06:32

Heck, I know the original article is more than four years old... My apologies!

I see a way to simplify some of the expressions, by removing unnecessary concatenation and arithmetic:

-- Include empty strings as nulls, for direct comparison
SELECT
Author,
Word = NullIf(SubString(Phrase, ID, CharIndex(',', Phrase + ',', ID) - ID), '')
FROM
Tally T
INNER JOIN Quotes Q ON T.ID <= Len(Q.Phrase) AND Substring(',' + Q.Phrase, T.ID, 1) = ','

--Don't include empty strings (changes in blue)
SELECT
Author,
Word = SubString(Phrase, ID, CharIndex(',', Phrase + ',', ID) - ID)
FROM
Tally T
INNER JOIN Quotes Q ON T.ID <= Len(Q.Phrase) AND Substring(',' + Q.Phrase, T.ID, 2) LIKE ',[^,]'

ESquared
Starting Member

4 Posts

Posted - 2005-08-29 : 00:34:47

CREATE TABLE Quotes (Author varchar(100), Phrase varchar(7000))
INSERT Quotes SELECT 'Shakespeare', 'A,rose,by,any,other,name,smells,just,as,sweet'
UNION SELECT 'Kipling', 'Across,the,valley,of,death,rode,the,six,hundred'
UNION SELECT 'Coleridge', 'In,Xanadu,did,Kubla,Khan,...,,,,damn,I,forgot,the,rest,of,it'
UNION SELECT 'Descartes', 'I,think,therefore,I,am'
UNION SELECT 'Volk', 'I,think,therefore,I,need,another,beer'
UNION SELECT 'Feldman', 'No,it''s,pronounced,I,gor'
UNION SELECT 'Simpson', 'Mmmmmm,donuts'
UNION SELECT 'Fudd', 'Be,vewwy,vewwy,quiet,I''m,hunting,wabbits'

ESquared
Starting Member

4 Posts

Posted - 2005-09-09 : 17:53:26

CREATE TABLE Tally (ID int identity(1,1))

INSERT Tally DEFAULT VALUES
WHILE SCOPE_IDENTITY() < 8000 INSERT Tally DEFAULT VALUES

ashish_khanolkar
Starting Member

1 Post

Posted - 2006-10-03 : 18:57:23

This solution for 'Parsing CSV values into Multiple rows is excellent but I want to use this in SYBASE.

The only problem is CHARINDEX.

In SYBASE the CHARINDEX syntax is CHARINDEX(expression1, expression2).

How can I use this query in SYBASE?

bryonpierce
Starting Member

1 Post

Posted - 2008-04-18 : 13:16:16

--I loved this solution but here is a one executable script to see
--this in action using table variable...have fun :0

DECLARE @Quotes TABLE (Author varchar(100), Phrase varchar(7000))
INSERT INTO @Quotes SELECT 'Shakespeare', 'A,rose,by,any,other,name,smells,just,as,sweet'
UNION SELECT 'Kipling', 'Across,the,valley,of,death,rode,the,six,hundred'
UNION SELECT 'Coleridge', 'In,Xanadu,did,Kubla,Khan,...,,,,damn,I,forgot,the,rest,of,it'
UNION SELECT 'Descartes', 'I,think,therefore,I,am'
UNION SELECT 'Volk', 'I,think,therefore,I,need,another,beer'
UNION SELECT 'Feldman', 'No,it''s,pronounced,I,gor'
UNION SELECT 'Simpson', 'Mmmmmm,donuts'
UNION SELECT 'Fudd', 'Be,vewwy,vewwy,quiet,I''m,hunting,wabbits'

DECLARE @Tally TABLE (ID int identity(1,1))
INSERT @Tally DEFAULT VALUES
WHILE SCOPE_IDENTITY() < 8000 INSERT @Tally DEFAULT VALUES

DECLARE @OnlyWords TABLE (Author varchar(100), WORD varchar (100),Phrase varchar(7000))
INSERT INTO @OnlyWords SELECT Author,
NullIf(SubString(',' + Phrase + ',' , ID , CharIndex(',' , ',' + Phrase + ',' , ID) - ID) , '') AS Word ,
Phrase
FROM @Tally, @Quotes
WHERE ID <= Len(',' + Phrase + ',') AND SubString(',' + Phrase + ',' , ID - 1, 1) = ','

--select top 2 * from @Quotes
--select top 3 * from @Tally
select * from @OnlyWords
order by Author,Word,Phrase

Bryon Pierce
bryonpierce at geezmailz

kamalee
Starting Member

1 Post

Posted - 2009-03-10 : 12:46:10

Hi,

i implemented the code exactly as you show on your example, but the output only present a single value...

create table ParseWords1 (ID integer primary key, Location varchar(50))

insert into parsewords1 (ID, Location) values (1,'Toronto,Ontorio,Canada')
insert into parsewords1 (ID, Location) values (2,'Boston,Massachusetts,USA')
insert into parsewords1 (ID, Location) values (3,'Vancouver,British Columbia,Canada')

SELECT
NullIf(SubString(',' + Location + ',' , ID , CharIndex(',' , ',' + Location + ',' , ID) - ID) , '') AS Localidades
FROM parsewords1
WHERE ID <= Len(',' + Location + ',') AND SubString(',' + Location + ',' , ID - 1, 1) = ','
AND CharIndex(',' , ',' + Location + ',' , ID) - ID > 0

the output:

| Localidades
1| Boston

i'm using sql server 2005

kamalee

psycotech
Starting Member

5 Posts

Posted - 2009-04-27 : 01:54:24

Here is the nastiest SQL code i have ever Writen.
It takes your CSV file as a parameter and you can well insert the outcome to wherever you like.

I post this to perhaps save someone else the trouble of writing it.
If anyone who needs it sees any potential improvement, all is welcome.
Note some of the other solutions don't take into account double quotes and commas in the fields, this one does.

CREATE TABLE #TEST(
Line int,
The varchar(255))

DECLARE @err int
INSERT INTO #TEST

/* Execute dbo.Mailing_CSV_Read @CSV the file you want, 'Table Columns you need from csv'*/
EXEC @err = dbo.Mailing_CSV_Read 'CSVFILE Goes HERE', 'Line,The'

SELECT * FROM #TEST

CREATE PROCEDURE dbo.Mailing_CSV_Read
@CSV varchar(max),
@ColumnsReturn varchar(max)
AS

CREATE TABLE #Temp(
Line int,
Col int,
Val varchar(255))

DECLARE @Cols varchar(max)
DECLARE @Row varchar(max)
DECLARE @Col varchar(255)
DECLARE @RowStart int
DECLARE @RowEnd int
DECLARE @ColStart int
DECLARE @ColEnd int
DECLARE @Len int
DECLARE @LineINDX int
DECLARE @ColINDX int

/* Remove this sample data*/
SET @CSV = 'The,f,l,p,c,add,add
Ms ,Katrina,Rendell,,Somersault Group,"Level 2, 93 Pacific Highway ",
Mr ,Paul,Bird,,"Three, "", and the "" Plus",3/40 Querrin Street ,
Ms.,Dorothy ,Illing,,Three Plus,3/40 Querrin Street ,
Prof,Sandra ,Harding,Vice Chancellor & President,James Cook University ,,
Emeritus Prof,Colin ,Power AM,,Eidos Institute,4 Carnegie Street ,
Prof,Neil,Dempster,Prof in Education,Faculty of Education,Mount Gravatt Campus ,Griffith University
Ms.,Elizabeth,Jameson,Director,Board Matters ,P O Box 377 ,
Dr.,Cherrell ,Hirst A.O.,Acting CEO ,QBF,P O Box 2076 ,
Mr,John B,Reid AO ,School of Strategy & Entrepreneurship,Australian School of Business ,Level 5 Bldg E12,University of NSW
Mr,Christopher ,Thorn,Principal,Goldman Sachs JB Were ,Philanthropic Services ,GPO Box 2050
Mr,Jeremy,Madin,Headmaster,Cranbrook School,5 Victoria Road ,'

SET @RowEnd = 1
SET @RowStart = 0
SET @LineINDX = 0

WHILE @RowEnd > 0
BEGIN
SET @RowEnd = CHARINDEX(CHAR(13), @CSV, @RowStart)
SET @Len = @RowEnd - @RowStart
IF @RowEnd < 1 SET @Len = 99999
SET @Row = SUBSTRING(@CSV, @RowStart, @Len)

SET @ColEnd = 1
SET @ColStart = 0
SET @ColINDX = 0
WHILE @ColEnd > 0
BEGIN
SET @ColEnd = CHARINDEX(',', @Row, @ColStart)
SET @Len = @ColEnd - @ColStart
IF @ColEnd < 1 SET @Len = 99999
SET @Col = SUBSTRING(@Row, @ColStart, @Len)
IF LEFT(@Col,1) = '"'
BEGIN
SET @ColEnd = CHARINDEX('"', @Row, @ColStart + 1)
WHILE (SUBSTRING(@Row, @ColEnd + 1, 1) = '"')
SET @ColEnd = CHARINDEX('"', @Row, @ColEnd + 2)
SET @Len = @ColEnd - @ColStart - 2
IF @ColEnd < 1 SET @Len = 99999
SET @Col = REPLACE(SUBSTRING(@Row, @ColStart + 1, @Len), '""', '"')
END
INSERT INTO #Temp
VALUES (@LineINDX, @ColINDX, @Col)
SET @ColStart = @ColEnd + 1
SET @ColINDX = @ColINDX + 1
END
SET @RowStart = @RowEnd + 2
SET @LineINDX = @LineINDX + 1
END

DECLARE @columns varchar(max)
DECLARE @SelectColumns varchar(max)

SET @ColEnd = 1
SET @ColStart = 0
SET @ColINDX = 0
WHILE @ColEnd > 0
BEGIN
SET @ColEnd = CHARINDEX(',', @ColumnsReturn, @ColStart)
SET @Len = @ColEnd - @ColStart
IF @ColEnd < 1 SET @Len = 99999
SET @Col = SUBSTRING(@ColumnsReturn, @ColStart, @Len)
IF (@Col = 'Line')
SET @SelectColumns = COALESCE(@SelectColumns + ',[Line]', '[Line]')
ELSE BEGIN
SELECT @ColINDX = Col
FROM #Temp
WHERE Line = 0
AND Val = @Col
IF (@Col is NULL)
RETURN 0
SET @SelectColumns = COALESCE(@SelectColumns + ',[' + cast(@ColINDX as varchar) + '] as [' + @Col + ']',
'[' + cast(@ColINDX as varchar)+ '] as [' + @Col + ']')
SET @columns = COALESCE(@columns + ',[' + cast(@ColINDX as varchar) + ']', '[' + cast(@ColINDX as varchar)+ ']')
END
SET @ColStart = @ColEnd + 1
END

DELETE FROM #Temp
WHERE Line = 0

DECLARE @Query VARCHAR(max)

SET @Query = '
SELECT ' + @SelectColumns +
'FROM #temp
PIVOT
(
MIN(Val)
FOR [Col]
IN (' + @columns + ')
)
AS p'

EXEC(@Query)

Next Page

Subscribe to SQLTeam.com

SQLTeam.com Articles via RSS

SQLTeam.com Weblog via RSS

- Advertisement -

Resources