Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL Puzzle 12b: Tokenization 10

Status
Not open for further replies.

vongrunt

Programmer
Mar 8, 2004
4,863
HR
There are two tables:
Code:
create table TestStrings( PK int primary key, string varchar(255) null )
insert into TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into TestStrings values (2, NULL )
insert into TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into TestStrings values (4, 'Please do not look into laser with remaining eyeball!')
insert into TestStrings values (5, 'Blah!')
insert into TestStrings values (6, ':)')
insert into TestStrings values (7, '** snort **')

create table NoiseWords ( word varchar(16) )
insert into NoiseWords values ( 'the' )
insert into NoiseWords values ( 'do' )
insert into NoiseWords values ( 'is' )
insert into NoiseWords values ( 'on' )
create unique nonclustered index IX_NoiseWords on NoiseWord( word )


Objective

For all rows in TestStrings table, extract words into ordered set:
Code:
PK Pos Word
--.---.------
 1   1 Even 
 1   2 broken 
 1   3 clock 
 1   4 right 
 1   5 two 
 1   6 times 
 1   7 day
 1   8 accident
 2   1 Why
 ....
--.---.------

Tokenization rules are:

- words are composed from 0-9, A-Z, a-z and _ (underscore) characters
- all other characters are considered word delimiters
- words listed in NoiseWords table must be ignored
- in addition, all tokens shorter than 2 characters (for example 'a' or 'I') are considered noise words
- do not return empty tokens (""),
- return NULL when input string is NULL (see 2nd test string)

Everything must be returned in one set, with words properly enumerated (column Pos)


Rules & restrictions

Everything is allowed - SQL2000, 2005 - except calling external programs or going .NET/CLR.

There are no time limits - you can shoot immediately.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
My turn to award some purples and special thanks to maswien for the regex pointer. I wasn't aware that this was available (limited though it is - it dropped my code's time from over 15 minutes to 40 or so seconds).


Since no one has responded to my "bug" challenge:

My original code would ignore the last word in a sentence if the sentence didn't end with an invalid character.

SQLDenis' code doesn't handle gaps in the PK field in #TestStrings - it will simply repeat the list of valid words from the last actual PK sentence until the next PK occurs.

[vampire][bat]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top