SQL Server Forums
Profile | Register | Active Topics | Members | Search | Forum FAQ
 
Register Now and get your question answered!
Username:
Password:
Save Password
Forgot your Password?

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 How to extract data from HTML file?
 New Topic  Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

Sun Foster
Aged Yak Warrior

515 Posts

Posted - 09/26/2007 :  06:32:39  Show Profile  Reply with Quote
I need data from a table in which there is a column stored entire HTML file including all tag such as <html>, <body>...
How to extract data from HTML file?

SwePeso
Patron Saint of Lost Yaks

Sweden
30265 Posts

Posted - 09/26/2007 :  07:11:14  Show Profile  Visit SwePeso's Homepage  Reply with Quote
See http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=89973



E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

delly
Starting Member

1 Posts

Posted - 09/26/2007 :  07:22:00  Show Profile  Reply with Quote
Had the same problem, but found this code to create a function to clean up the HTML tags. Hope it does what you want


ALTER FUNCTION [dbo].[fn_StripHTMLTags]
(@Dirty varchar(4000))
Returns varchar(4000)
AS

BEGIN
DECLARE @Start int,
@End int,
@Length int

WHILE CHARINDEX('<', @Dirty) > 0 AND CHARINDEX('>', @Dirty, CHARINDEX('<', @Dirty)) > 0
BEGIN
SELECT @Start = CHARINDEX('<', @Dirty),
@End = CHARINDEX('>', @Dirty, CHARINDEX('<', @Dirty))
SELECT @Length = (@End - @Start) + 1
IF @Length > 0
BEGIN
SELECT @Dirty = STUFF(@Dirty, @Start, @Length, '')
END
END

RETURN @Dirty
END

Go to Top of Page

Kristen
Test

United Kingdom
22415 Posts

Posted - 09/26/2007 :  07:42:15  Show Profile  Reply with Quote
I think you need to do this with some sort of client-side process.

Perl has an HTML object library which will parse the HTML and present the tags as a hierarchical array.

Any sort of pattern matching for <xxx>yyy</xxx> is going to be patchy at best, unless the HTML is very clean (i.e. NEVER has any errors caused by sloppy developers using IE to test their HTML), and probably any HTML that is not pretty simple too.

But delly's function is definitely worth a look, you could perhaps strip everything down to the appropriate <TABLE> tag, hopefully it has an ID or NAME that is unique, and then everything after the </TABLE>, and then change all the </TD> to TAB, or somesuch, and remove all the <TD>. Then similarly change all </TR> to CR+LF and remove all the <TR>, and then you might have a delimited list ... assuming no other embedded TABs and CR/LFs - but you could REPLACE those first to get rid of them.

Kristen
Go to Top of Page
  Previous Topic Topic Next Topic  
 New Topic  Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.06 seconds. Powered By: Snitz Forums 2000