Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 How to extract data from HTML file?

Author  Topic 

Sun Foster
Aged Yak Warrior

515 Posts

Posted - 2007-09-26 : 06:32:39
I need data from a table in which there is a column stored entire HTML file including all tag such as <html>, <body>...
How to extract data from HTML file?

SwePeso
Patron Saint of Lost Yaks

30421 Posts

Posted - 2007-09-26 : 07:11:14
See http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=89973



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

delly
Starting Member

1 Post

Posted - 2007-09-26 : 07:22:00
Had the same problem, but found this code to create a function to clean up the HTML tags. Hope it does what you want


ALTER FUNCTION [dbo].[fn_StripHTMLTags]
(@Dirty varchar(4000))
Returns varchar(4000)
AS

BEGIN
DECLARE @Start int,
@End int,
@Length int

WHILE CHARINDEX('<', @Dirty) > 0 AND CHARINDEX('>', @Dirty, CHARINDEX('<', @Dirty)) > 0
BEGIN
SELECT @Start = CHARINDEX('<', @Dirty),
@End = CHARINDEX('>', @Dirty, CHARINDEX('<', @Dirty))
SELECT @Length = (@End - @Start) + 1
IF @Length > 0
BEGIN
SELECT @Dirty = STUFF(@Dirty, @Start, @Length, '')
END
END

RETURN @Dirty
END

Go to Top of Page

Kristen
Test

22859 Posts

Posted - 2007-09-26 : 07:42:15
I think you need to do this with some sort of client-side process.

Perl has an HTML object library which will parse the HTML and present the tags as a hierarchical array.

Any sort of pattern matching for <xxx>yyy</xxx> is going to be patchy at best, unless the HTML is very clean (i.e. NEVER has any errors caused by sloppy developers using IE to test their HTML), and probably any HTML that is not pretty simple too.

But delly's function is definitely worth a look, you could perhaps strip everything down to the appropriate <TABLE> tag, hopefully it has an ID or NAME that is unique, and then everything after the </TABLE>, and then change all the </TD> to TAB, or somesuch, and remove all the <TD>. Then similarly change all </TR> to CR+LF and remove all the <TR>, and then you might have a delimited list ... assuming no other embedded TABs and CR/LFs - but you could REPLACE those first to get rid of them.

Kristen
Go to Top of Page
   

- Advertisement -