Please start any new threads on our new site at We've got lots of great SQL Server experts to answer whatever question you can come up with.

Our new SQL Server Forums are live! Come on over! We've restricted the ability to create new threads on these forums.

SQL Server Forums
Profile | Active Topics | Members | Search | Forum FAQ
Register Now and get your question answered!
Save Password
Forgot your Password?

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 How to extract data from HTML file?
 Reply to Topic
 Printer Friendly
Author Previous Topic Topic Next Topic  

Sun Foster
Aged Yak Warrior

515 Posts

Posted - 09/26/2007 :  06:32:39  Show Profile  Reply with Quote
I need data from a table in which there is a column stored entire HTML file including all tag such as <html>, <body>...
How to extract data from HTML file?

Patron Saint of Lost Yaks

30421 Posts

Posted - 09/26/2007 :  07:11:14  Show Profile  Visit SwePeso's Homepage  Reply with Quote

E 12°55'05.25"
N 56°04'39.16"
Go to Top of Page

Starting Member

1 Posts

Posted - 09/26/2007 :  07:22:00  Show Profile  Reply with Quote
Had the same problem, but found this code to create a function to clean up the HTML tags. Hope it does what you want

ALTER FUNCTION [dbo].[fn_StripHTMLTags]
(@Dirty varchar(4000))
Returns varchar(4000)

DECLARE @Start int,
@End int,
@Length int

WHILE CHARINDEX('<', @Dirty) > 0 AND CHARINDEX('>', @Dirty, CHARINDEX('<', @Dirty)) > 0
SELECT @Start = CHARINDEX('<', @Dirty),
@End = CHARINDEX('>', @Dirty, CHARINDEX('<', @Dirty))
SELECT @Length = (@End - @Start) + 1
IF @Length > 0
SELECT @Dirty = STUFF(@Dirty, @Start, @Length, '')


Go to Top of Page


United Kingdom
22859 Posts

Posted - 09/26/2007 :  07:42:15  Show Profile  Reply with Quote
I think you need to do this with some sort of client-side process.

Perl has an HTML object library which will parse the HTML and present the tags as a hierarchical array.

Any sort of pattern matching for <xxx>yyy</xxx> is going to be patchy at best, unless the HTML is very clean (i.e. NEVER has any errors caused by sloppy developers using IE to test their HTML), and probably any HTML that is not pretty simple too.

But delly's function is definitely worth a look, you could perhaps strip everything down to the appropriate <TABLE> tag, hopefully it has an ID or NAME that is unique, and then everything after the </TABLE>, and then change all the </TD> to TAB, or somesuch, and remove all the <TD>. Then similarly change all </TR> to CR+LF and remove all the <TR>, and then you might have a delimited list ... assuming no other embedded TABs and CR/LFs - but you could REPLACE those first to get rid of them.

Go to Top of Page
  Previous Topic Topic Next Topic  
 Reply to Topic
 Printer Friendly
Jump To:
SQL Server Forums © 2000-2009 SQLTeam Publishing, LLC Go To Top Of Page
This page was generated in 0.06 seconds. Powered By: Snitz Forums 2000