Please start any new threads on our new site at https://forums.sqlteam.com. We've got lots of great SQL Server experts to answer whatever question you can come up with.

 All Forums
 SQL Server 2005 Forums
 Transact-SQL (2005)
 Parsing words?

Author  Topic 

medtech26
Posting Yak Master

169 Posts

Posted - 2007-11-20 : 14:24:48
Hello All,

I've got a table with id's and terms (about 1m records), the average word count per term is three (this is my guess based on rough visual look, I never checked). I would like to take all those terms and beak them down into words, so the result should be id|word (about 3m?).

That way I can create fast (indexed) search engine. So, is there a tool (or query) to the task? (search tips are welcome as I have no idea what to search for)

georgev
Posting Yak Master

122 Posts

Posted - 2007-11-20 : 14:55:51
Have you thought about looking into FULL TEXT INDEXING on this table?


George
<3Engaged!
Go to Top of Page

medtech26
Posting Yak Master

169 Posts

Posted - 2007-11-20 : 15:25:18
Sure, but didn't like it.

Tested same data (about 30K records) with both methods, the concept of breaking the terms into words and searching indexed data works much faster then the FULL TEXT INDEX method, it shows even better when searching multiple for words. Though it might be because I don't know how to fine tune (or even rough tune ) the FULL TEXT INDEX.

I just need a tool to keep it updated.
Go to Top of Page

medtech26
Posting Yak Master

169 Posts

Posted - 2007-11-21 : 15:51:23
Last trial (if I'm on the wrong track ... please let me know)
Go to Top of Page

Vinnie881
Master Smack Fu Yak Hacker

1231 Posts

Posted - 2007-11-21 : 17:05:10
Create a function, if you post some sample data with table structures I'll help you with it.

Go to Top of Page

medtech26
Posting Yak Master

169 Posts

Posted - 2007-11-21 : 20:32:28
quote:
Originally posted by Vinnie881

Create a function, if you post some sample data with table structures I'll help you with it.


Thanks. I found something online ([url]http://www.thescripts.com/forum/thread81852.html[/url]) and modified it a bit. It's kind of working (the results are there but with an error message: 'Msg 245, Level 16, State 1, Line 16 - Conversion failed when converting the varchar value '*' to data type int.', any ideas?), however, I would like to modify it further more, I would like to split the text at any none alphanumeric characters (I know that I can use RegEx but don't know how). This is what I've got:

CREATE PROCEDURE SPLITIT(@id AS INTEGER, @sentence AS VARCHAR(1024)) AS
BEGIN
DECLARE @Position AS INTEGER, @Str AS VARCHAR(50)

SET @Position = CHARINDEX(' ', @sentence)

IF(@Position = 0)
INSERT INTO aaaTmpWordTable (id, Word) VALUES (@id, @sentence)
ELSE
BEGIN
SET @Str = LEFT(@sentence, @Position-1)
INSERT INTO aaaTmpWordTable (id, Word) VALUES (@id, @Str)
SET @Str = RIGHT(@sentence, LEN(@sentence) - @Position)
EXEC SPLITIT @id, @Str
END
END

Go
DROP TABLE aaaTmpWordTable
Go
CREATE TABLE aaaTmpWordTable (id INTEGER, Word NVARCHAR(1024))

DECLARE @PCStatus VARCHAR
DECLARE @TmpId INTEGER
DECLARE @TmpWords VARCHAR(1024)

DECLARE PCcursor CURSOR FOR
SELECT id, term
FROM table
OPEN PCcursor
/* read the 1st row */
SET @PCStatus = 0
WHILE @PCStatus = 0 BEGIN
FETCH NEXT FROM PCcursor INTO @TmpId, @TmpWords
SET @PCStatus = @@Fetch_status
IF @PCstatus <> 0 BREAK
EXEC SPLITIT @TmpId, @TmpWords
END
CLOSE PcCursor
DEALLOCATE PCcursor
Go to Top of Page

dataguru1971
Master Smack Fu Yak Hacker

1464 Posts

Posted - 2007-11-21 : 20:37:07
That sample code splits the words that are separated by a space. Separating them by any non alpha numeric character would be a bit more involved.

Sample data and desired results is key.




Poor planning on your part does not constitute an emergency on my part.

Go to Top of Page

medtech26
Posting Yak Master

169 Posts

Posted - 2007-11-21 : 21:08:08
quote:
Originally posted by dataguru1971

That sample code splits the words that are separated by a space. Separating them by any non alpha numeric character would be a bit more involved.

Sample data and desired results is key.


Sample data is irrelevant as there are more then 1m records for different subjects in different languages . Desired results however are any group of alphanumeric character, so all characters like '.', '$', '_', '-', '/', '*' etc. should be considered as spaces. Also, there should be some exceptions (the only one that comes to my mind right now is: 's).
Go to Top of Page

Vinnie881
Master Smack Fu Yak Hacker

1231 Posts

Posted - 2007-11-21 : 21:51:52
Keep in mind how this forum is so people who may know how to resolve your issue will help you. When you post your table structure, sample data and desired results, it allows someone to quickly come in and write the code needed to resolve your issue. When that is not provided, it takes longer because it makes who ever is trying to help you need to waste time to generate their own data set, so please understand it is not irrelevent. It's not that I or other people can't figure out what you want from what you have explained, it's that it takes more time to write the code.


You do not need your exact table structure, you just need something so it gives something to work off of.

I'm sure people will be more than happy to assist once you provide that information. If you want to do it yourself, here is what I would do:

Write a function to split the text string into words and insert them into a table. You should probably split the words using the ASCII function so you can say if ASCII between 64 and 122 or Ascii between 48 and 57.

That give you all A-Z a-z 0-9 as valid charectors, if it's not that then split. You will need to factor in if you want to include "'" or anything else as valid. Like I said I would write this for you if you provided the table structure and some sample data.
Go to Top of Page

medtech26
Posting Yak Master

169 Posts

Posted - 2007-11-21 : 22:08:15
quote:
Originally posted by Vinnie881

Keep in mind how this forum is so people who may know how to resolve your issue will help you. When you post your table structure, sample data and desired results, it allows someone to quickly come in and write the code needed to resolve your issue. When that is not provided, it takes longer because it makes who ever is trying to help you need to waste time to generate their own data set, so please understand it is not irrelevent. It's not that I or other people can't figure out what you want from what you have explained, it's that it takes more time to write the code.


You do not need your exact table structure, you just need something so it gives something to work off of.

I'm sure people will be more than happy to assist once you provide that information. If you want to do it yourself, here is what I would do:

Write a function to split the text string into words and insert them into a table. You should probably split the words using the ASCII function so you can say if ASCII between 64 and 122 or Ascii between 48 and 57.

That give you all A-Z a-z 0-9 as valid charectors, if it's not that then split. You will need to factor in if you want to include "'" or anything else as valid. Like I said I would write this for you if you provided the table structure and some sample data.



Thanks for the info. I assumed that giving sample data wouldn't cover much as the table is very large and there are many variations. I'll try to work with what you suggested and if further assistance needed I'll keep you guys updated.
Go to Top of Page

medtech26
Posting Yak Master

169 Posts

Posted - 2007-11-26 : 19:22:31
Update: (didn't work on it till now ... )

Solve this issue by coping the data into a temporary table, then replacing all unwanted alphanumeric characters with spaces and send to the recursive function. Works great.

Thanks All
Go to Top of Page
   

- Advertisement -