There are two tables:
Objective
For all rows in TestStrings table, extract words into ordered set:
Tokenization rules are:
- words are composed from 0-9, A-Z, a-z and _ (underscore) characters
- all other characters are considered word delimiters
- words listed in NoiseWords table must be ignored
- in addition, all tokens shorter than 2 characters (for example 'a' or 'I') are considered noise words
- do not return empty tokens (""),
- return NULL when input string is NULL (see 2nd test string)
Everything must be returned in one set, with words properly enumerated (column Pos)
Rules & restrictions
Everything is allowed - SQL2000, 2005 - except calling external programs or going .NET/CLR.
There are no time limits - you can shoot immediately.
------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
![[banghead] [banghead] [banghead]](/data/assets/smilies/banghead.gif)
Code:
create table TestStrings( PK int primary key, string varchar(255) null )
insert into TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into TestStrings values (2, NULL )
insert into TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into TestStrings values (4, 'Please do not look into laser with remaining eyeball!')
insert into TestStrings values (5, 'Blah!')
insert into TestStrings values (6, ':)')
insert into TestStrings values (7, '** snort **')
create table NoiseWords ( word varchar(16) )
insert into NoiseWords values ( 'the' )
insert into NoiseWords values ( 'do' )
insert into NoiseWords values ( 'is' )
insert into NoiseWords values ( 'on' )
create unique nonclustered index IX_NoiseWords on NoiseWord( word )
Objective
For all rows in TestStrings table, extract words into ordered set:
Code:
PK Pos Word
--.---.------
1 1 Even
1 2 broken
1 3 clock
1 4 right
1 5 two
1 6 times
1 7 day
1 8 accident
2 1 Why
....
--.---.------
Tokenization rules are:
- words are composed from 0-9, A-Z, a-z and _ (underscore) characters
- all other characters are considered word delimiters
- words listed in NoiseWords table must be ignored
- in addition, all tokens shorter than 2 characters (for example 'a' or 'I') are considered noise words
- do not return empty tokens (""),
- return NULL when input string is NULL (see 2nd test string)
Everything must be returned in one set, with words properly enumerated (column Pos)
Rules & restrictions
Everything is allowed - SQL2000, 2005 - except calling external programs or going .NET/CLR.
There are no time limits - you can shoot immediately.
------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why
![[banghead] [banghead] [banghead]](/data/assets/smilies/banghead.gif)