Please start any new threads on our new
site at https://forums.sqlteam.com. We've got lots of great SQL Server
experts to answer whatever question you can come up with.
Author |
Topic |
Sun Foster
Aged Yak Warrior
515 Posts |
Posted - 2007-09-26 : 06:32:39
|
I need data from a table in which there is a column stored entire HTML file including all tag such as <html>, <body>...How to extract data from HTML file? |
|
SwePeso
Patron Saint of Lost Yaks
30421 Posts |
|
delly
Starting Member
1 Post |
Posted - 2007-09-26 : 07:22:00
|
Had the same problem, but found this code to create a function to clean up the HTML tags. Hope it does what you want ALTER FUNCTION [dbo].[fn_StripHTMLTags] (@Dirty varchar(4000)) Returns varchar(4000)ASBEGIN DECLARE @Start int, @End int, @Length int WHILE CHARINDEX('<', @Dirty) > 0 AND CHARINDEX('>', @Dirty, CHARINDEX('<', @Dirty)) > 0 BEGIN SELECT @Start = CHARINDEX('<', @Dirty), @End = CHARINDEX('>', @Dirty, CHARINDEX('<', @Dirty)) SELECT @Length = (@End - @Start) + 1 IF @Length > 0 BEGIN SELECT @Dirty = STUFF(@Dirty, @Start, @Length, '') END END RETURN @DirtyEND |
|
|
Kristen
Test
22859 Posts |
Posted - 2007-09-26 : 07:42:15
|
I think you need to do this with some sort of client-side process.Perl has an HTML object library which will parse the HTML and present the tags as a hierarchical array.Any sort of pattern matching for <xxx>yyy</xxx> is going to be patchy at best, unless the HTML is very clean (i.e. NEVER has any errors caused by sloppy developers using IE to test their HTML), and probably any HTML that is not pretty simple too.But delly's function is definitely worth a look, you could perhaps strip everything down to the appropriate <TABLE> tag, hopefully it has an ID or NAME that is unique, and then everything after the </TABLE>, and then change all the </TD> to TAB, or somesuch, and remove all the <TD>. Then similarly change all </TR> to CR+LF and remove all the <TR>, and then you might have a delimited list ... assuming no other embedded TABs and CR/LFs - but you could REPLACE those first to get rid of them.Kristen |
|
|
|
|
|
|
|