| Author |
Topic |
|
medtech26
Posting Yak Master
169 Posts |
Posted - 2007-11-20 : 14:24:48
|
| Hello All,I've got a table with id's and terms (about 1m records), the average word count per term is three (this is my guess based on rough visual look, I never checked). I would like to take all those terms and beak them down into words, so the result should be id|word (about 3m?).That way I can create fast (indexed) search engine. So, is there a tool (or query) to the task? (search tips are welcome as I have no idea what to search for) |
|
|
georgev
Posting Yak Master
122 Posts |
Posted - 2007-11-20 : 14:55:51
|
Have you thought about looking into FULL TEXT INDEXING on this table? George<3Engaged! |
 |
|
|
medtech26
Posting Yak Master
169 Posts |
Posted - 2007-11-20 : 15:25:18
|
Sure, but didn't like it.Tested same data (about 30K records) with both methods, the concept of breaking the terms into words and searching indexed data works much faster then the FULL TEXT INDEX method, it shows even better when searching multiple for words. Though it might be because I don't know how to fine tune (or even rough tune ) the FULL TEXT INDEX.I just need a tool to keep it updated. |
 |
|
|
medtech26
Posting Yak Master
169 Posts |
Posted - 2007-11-21 : 15:51:23
|
| Last trial (if I'm on the wrong track ... please let me know) |
 |
|
|
Vinnie881
Master Smack Fu Yak Hacker
1231 Posts |
Posted - 2007-11-21 : 17:05:10
|
| Create a function, if you post some sample data with table structures I'll help you with it. |
 |
|
|
medtech26
Posting Yak Master
169 Posts |
Posted - 2007-11-21 : 20:32:28
|
quote: Originally posted by Vinnie881 Create a function, if you post some sample data with table structures I'll help you with it.
Thanks. I found something online ([url]http://www.thescripts.com/forum/thread81852.html[/url]) and modified it a bit. It's kind of working (the results are there but with an error message: 'Msg 245, Level 16, State 1, Line 16 - Conversion failed when converting the varchar value '*' to data type int.', any ideas?), however, I would like to modify it further more, I would like to split the text at any none alphanumeric characters (I know that I can use RegEx but don't know how). This is what I've got:CREATE PROCEDURE SPLITIT(@id AS INTEGER, @sentence AS VARCHAR(1024)) ASBEGIN DECLARE @Position AS INTEGER, @Str AS VARCHAR(50) SET @Position = CHARINDEX(' ', @sentence) IF(@Position = 0) INSERT INTO aaaTmpWordTable (id, Word) VALUES (@id, @sentence) ELSE BEGIN SET @Str = LEFT(@sentence, @Position-1) INSERT INTO aaaTmpWordTable (id, Word) VALUES (@id, @Str) SET @Str = RIGHT(@sentence, LEN(@sentence) - @Position) EXEC SPLITIT @id, @Str ENDENDGo DROP TABLE aaaTmpWordTableGo CREATE TABLE aaaTmpWordTable (id INTEGER, Word NVARCHAR(1024))DECLARE @PCStatus VARCHARDECLARE @TmpId INTEGERDECLARE @TmpWords VARCHAR(1024)DECLARE PCcursor CURSOR FOR SELECT id, term FROM table OPEN PCcursor/* read the 1st row */ SET @PCStatus = 0 WHILE @PCStatus = 0 BEGIN FETCH NEXT FROM PCcursor INTO @TmpId, @TmpWords SET @PCStatus = @@Fetch_status IF @PCstatus <> 0 BREAK EXEC SPLITIT @TmpId, @TmpWords ENDCLOSE PcCursorDEALLOCATE PCcursor |
 |
|
|
dataguru1971
Master Smack Fu Yak Hacker
1464 Posts |
Posted - 2007-11-21 : 20:37:07
|
That sample code splits the words that are separated by a space. Separating them by any non alpha numeric character would be a bit more involved.Sample data and desired results is key. Poor planning on your part does not constitute an emergency on my part. |
 |
|
|
medtech26
Posting Yak Master
169 Posts |
Posted - 2007-11-21 : 21:08:08
|
quote: Originally posted by dataguru1971 That sample code splits the words that are separated by a space. Separating them by any non alpha numeric character would be a bit more involved.Sample data and desired results is key.
Sample data is irrelevant as there are more then 1m records for different subjects in different languages . Desired results however are any group of alphanumeric character, so all characters like '.', '$', '_', '-', '/', '*' etc. should be considered as spaces. Also, there should be some exceptions (the only one that comes to my mind right now is: 's). |
 |
|
|
Vinnie881
Master Smack Fu Yak Hacker
1231 Posts |
Posted - 2007-11-21 : 21:51:52
|
| Keep in mind how this forum is so people who may know how to resolve your issue will help you. When you post your table structure, sample data and desired results, it allows someone to quickly come in and write the code needed to resolve your issue. When that is not provided, it takes longer because it makes who ever is trying to help you need to waste time to generate their own data set, so please understand it is not irrelevent. It's not that I or other people can't figure out what you want from what you have explained, it's that it takes more time to write the code.You do not need your exact table structure, you just need something so it gives something to work off of.I'm sure people will be more than happy to assist once you provide that information. If you want to do it yourself, here is what I would do:Write a function to split the text string into words and insert them into a table. You should probably split the words using the ASCII function so you can say if ASCII between 64 and 122 or Ascii between 48 and 57.That give you all A-Z a-z 0-9 as valid charectors, if it's not that then split. You will need to factor in if you want to include "'" or anything else as valid. Like I said I would write this for you if you provided the table structure and some sample data. |
 |
|
|
medtech26
Posting Yak Master
169 Posts |
Posted - 2007-11-21 : 22:08:15
|
quote: Originally posted by Vinnie881 Keep in mind how this forum is so people who may know how to resolve your issue will help you. When you post your table structure, sample data and desired results, it allows someone to quickly come in and write the code needed to resolve your issue. When that is not provided, it takes longer because it makes who ever is trying to help you need to waste time to generate their own data set, so please understand it is not irrelevent. It's not that I or other people can't figure out what you want from what you have explained, it's that it takes more time to write the code.You do not need your exact table structure, you just need something so it gives something to work off of.I'm sure people will be more than happy to assist once you provide that information. If you want to do it yourself, here is what I would do:Write a function to split the text string into words and insert them into a table. You should probably split the words using the ASCII function so you can say if ASCII between 64 and 122 or Ascii between 48 and 57.That give you all A-Z a-z 0-9 as valid charectors, if it's not that then split. You will need to factor in if you want to include "'" or anything else as valid. Like I said I would write this for you if you provided the table structure and some sample data.
Thanks for the info. I assumed that giving sample data wouldn't cover much as the table is very large and there are many variations. I'll try to work with what you suggested and if further assistance needed I'll keep you guys updated. |
 |
|
|
medtech26
Posting Yak Master
169 Posts |
Posted - 2007-11-26 : 19:22:31
|
Update: (didn't work on it till now ... )Solve this issue by coping the data into a temporary table, then replacing all unwanted alphanumeric characters with spaces and send to the recursive function. Works great.Thanks All |
 |
|
|
|