Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL Puzzle 12b: Tokenization 10

Status
Not open for further replies.

vongrunt

Programmer
Mar 8, 2004
4,863
HR
There are two tables:
Code:
create table TestStrings( PK int primary key, string varchar(255) null )
insert into TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into TestStrings values (2, NULL )
insert into TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into TestStrings values (4, 'Please do not look into laser with remaining eyeball!')
insert into TestStrings values (5, 'Blah!')
insert into TestStrings values (6, ':)')
insert into TestStrings values (7, '** snort **')

create table NoiseWords ( word varchar(16) )
insert into NoiseWords values ( 'the' )
insert into NoiseWords values ( 'do' )
insert into NoiseWords values ( 'is' )
insert into NoiseWords values ( 'on' )
create unique nonclustered index IX_NoiseWords on NoiseWord( word )


Objective

For all rows in TestStrings table, extract words into ordered set:
Code:
PK Pos Word
--.---.------
 1   1 Even 
 1   2 broken 
 1   3 clock 
 1   4 right 
 1   5 two 
 1   6 times 
 1   7 day
 1   8 accident
 2   1 Why
 ....
--.---.------

Tokenization rules are:

- words are composed from 0-9, A-Z, a-z and _ (underscore) characters
- all other characters are considered word delimiters
- words listed in NoiseWords table must be ignored
- in addition, all tokens shorter than 2 characters (for example 'a' or 'I') are considered noise words
- do not return empty tokens (""),
- return NULL when input string is NULL (see 2nd test string)

Everything must be returned in one set, with words properly enumerated (column Pos)


Rules & restrictions

Everything is allowed - SQL2000, 2005 - except calling external programs or going .NET/CLR.

There are no time limits - you can shoot immediately.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Small correction:
Code:
PK Pos Word
--.------.------
 1   1    Even
 1   2    broken
 1   3    clock
 1   4    right
 1   5    two
 1   6    times
 1   7    day
 1   8    accident
 [!]2   1    NULL
 3[/!]   1    Why
 ....
--.------.------

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
I smell numbers table... [smile].

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
> extra point if I give this?

I'd say this feature is trivial to implement compared to rest of the problem.

If code works = purple thing.

Once we collect several code examples (aka: when George wakes up :p) we'll do testing on larger tables - 10,000 phrases or so. Who made the fastest code, makes next puzzle.

> you want multi-threaded as 2 words right?

Yup.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
I left 6 out
You should be able to hit F5 and the whole thing should run in 1 shot since I used all temp tables
BTW on SQL server 2005 of course
Code:
create table #TestStrings( PK int primary key, string varchar(255) null )
insert into #TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into #TestStrings values (2, NULL )
insert into #TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into #TestStrings values (4, 'Please do not look into laser with remaining eyeball!')
insert into #TestStrings values (5, 'Blah!')
insert into #TestStrings values (6, ':)')
insert into #TestStrings values (7, '** snort **')

create table #NoiseWords ( word varchar(16) )
insert into #NoiseWords values ( 'the' )
insert into #NoiseWords values ( 'do' )
insert into #NoiseWords values ( 'is' )
insert into #NoiseWords values ( 'on' )
--create unique nonclustered index IX_#NoiseWords on NoiseWord( word )


-- Create our Pivot table ** do this only once-- populate it with 1000 rows
CREATE TABLE #NumberPivot (NumberID INT PRIMARY KEY)
DECLARE @intLoopCounter INT
SELECT @intLoopCounter =0
WHILE @intLoopCounter <=999 BEGIN
INSERT INTO #NumberPivot
VALUES (@intLoopCounter)
SELECT @intLoopCounter = @intLoopCounter +1
END
GO 


create table #HoldStuff(PK int,PosInString int,Word varchar(50))

declare @Maxid int,@loopID int
DECLARE @chvGroupNumbers VARCHAR(1000)
select @loopID =1,@Maxid =Max(pk) from #TestStrings
while @loopID <= @Maxid
begin

	--This is it
	SELECT @chvGroupNumbers =replace(replace(replace(replace(replace(string,'.',' '),'*',''),':',''),'?',''),'-',' ') from #TestStrings
	where PK =@loopID
	print '@chvGroupNumbers   === ' + @chvGroupNumbers
	If nullif(rtrim(@chvGroupNumbers),'') is null
	begin
	Insert into #HoldStuff
	select @loopID,1,NULL
	end
	else
	begin
	Insert into #HoldStuff
	SELECT @loopID,NumberID,SUBSTRING(' ' + @chvGroupNumbers + ' ', NumberID + 1,
	CHARINDEX(' ', ' ' + @chvGroupNumbers + ' ', NumberID + 1) - NumberID -1)AS Value
	FROM #NumberPivot n
	WHERE NumberID <= LEN(' ' + @chvGroupNumbers + ' ') - 1
	AND SUBSTRING(' ' + @chvGroupNumbers + ' ', NumberID, 1) = ' '
	and SUBSTRING(' ' + @chvGroupNumbers + ' ', NumberID + 1,
	CHARINDEX(' ', ' ' + @chvGroupNumbers + ' ', NumberID + 1) - NumberID -1) <> '.'
	AND SUBSTRING(' ' + @chvGroupNumbers + ' ', NumberID + 1,
	CHARINDEX(' ', ' ' + @chvGroupNumbers + ' ', NumberID + 1) - NumberID -1) not in(
		select word from #NoiseWords)
	and len(SUBSTRING(' ' + @chvGroupNumbers + ' ', NumberID + 1,
	CHARINDEX(' ', ' ' + @chvGroupNumbers + ' ', NumberID + 1) - NumberID -1)) >1
	end
	set @loopID =@loopID +1
end

select pk,ROW_NUMBER() OVER(PARTITION BY PK ORDER BY PosInString ) as 'pos ',Word,PosInString
  from #HoldStuff
order by 1

drop table #HoldStuff,#TestStrings,#NumberPivot,#NoiseWords

Denis The SQL Menace
SQL blog:
Personal Blog:
 
Other than

- 'eyeball[!]![/!]'/'Blah[!]![/!]'
- 'snort' has PosInString=2 instead of 4

... works fine.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
replace this
SELECT @chvGroupNumbers =replace(replace(
replace(replace(replace(string,'.',' '),'*',''),':',''),
'?',''),'-',' ') from #TestStrings
with
SELECT @chvGroupNumbers =replace(replace(
replace(replace(replace(replace(string,'.',' '),'*',''),
':',''),'?',''),'-',' '),'!','') from #TestStrings

and it should be fine

;-)

Denis The SQL Menace
SQL blog:
Personal Blog:
 
OK, this code ain't speed demon but works correctly. It's interesting how many things can be done with numbers table.

Let's see what other ppl have to show...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
My attempt - I've handled NULL in sentence 2 and for sentence 6 I've output <No Valid Words>

Code:
set nocount on

create table #TestStrings( PK int primary key, string varchar(255) null )
insert into #TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into #TestStrings values (2, NULL )
insert into #TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into #TestStrings values (4, 'Please do not look into laser with remaining eyeball!')
insert into #TestStrings values (5, 'Blah!')
insert into #TestStrings values (6, ':)')
insert into #TestStrings values (7, '** snort **')

create table #NoiseWords ( word varchar(16) )
insert into #NoiseWords values ( 'the' )
insert into #NoiseWords values ( 'do' )
insert into #NoiseWords values ( 'is' )
insert into #NoiseWords values ( 'on' )

--Build a list of valid charachaters
create table #ValidChars (ValidChar char(1))
declare @vc int
set @vc = 65
while @vc < 65 + 26
begin
	insert into #ValidChars values (char(@vc))
	set @vc = @vc + 1
end
set @vc = 97
while @vc < 97 + 26
begin
	insert into #ValidChars values (char(@vc))
	set @vc = @vc + 1
end
set @vc = 48
while @vc < 48 + 10
begin
	insert into #ValidChars values (char(@vc))
	set @vc = @vc + 1
end
insert into #ValidChars values ('_')

--Get the individual words
create table #WordsList (sentence int, pos int, string varchar(255) null ) 
declare @chars varchar(255)
declare @string varchar(255)
declare @pos int
declare @maxsentence int
declare @thissentence int
declare @charsinsentence int
declare @thischar int

set @thissentence = 1
set @maxsentence = (select max(PK) from #TestStrings)

while @thissentence <= @maxsentence
begin
	set @pos = 0
	set @string = (select string from #TestStrings where pk = @thissentence)
	set @charsinsentence = len(@string)
	set @thischar = 1
	set @chars = ''
	--insert NULL if necessary
	if @string is null
	begin
		insert into #WordsList (sentence, pos, string) values (@thissentence, 1, @string)
	end
	else
	while @thischar <= @charsinsentence
	begin
		if substring(@string, @thischar, 1) IN (SELECT ValidChar from #ValidChars)
		begin
			set @chars = @chars + substring(@string, @thischar, 1)
			set @thischar = @thischar + 1
			print @chars
		end
		else
		begin
			if len(@chars) > 2
			begin
				set @pos = @pos + 1
				if @chars not in (select word from #NoiseWords)
					insert into #WordsList (sentence, pos, string) values (@thissentence, @pos, @chars)
			end
			set @chars = ''
			set @thischar = @thischar + 1
		end
	end
	if (select count(*) from #WordsList where sentence = @thissentence) = 0 
					insert into #WordsList (sentence, pos, string) values (@thissentence, 1, '<No Valid Words>')
		
	set @thissentence = @thissentence + 1
end

select * from #WordsList

drop table #TestStrings
drop table #NoiseWords
drop table #ValidChars
drop table #WordsList

set nocount off

[vampire][bat]
 
Pos column has gaps caused by noise words (cross=7, road=9). Everything else works fine. Btw. good idea with '<No Valid Words>'...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Sorry about that.

Code:
while @thissentence <= @maxsentence
begin
	set @pos = 0
	set @string = (select string from #TestStrings where pk = @thissentence)
	set @charsinsentence = len(@string)
	set @thischar = 1
	set @chars = ''
	--insert NULL if necessary
	if @string is null
	begin
		insert into #WordsList (sentence, pos, string) values (@thissentence, 1, @string)
	end
	else
	while @thischar <= @charsinsentence
	begin
		if substring(@string, @thischar, 1) IN (SELECT ValidChar from #ValidChars)
		begin
			set @chars = @chars + substring(@string, @thischar, 1)
			set @thischar = @thischar + 1
		end
		else
		begin
			if len(@chars) > 2
			begin
				set @pos = @pos + 1
				if @chars not in (select word from #NoiseWords)
					insert into #WordsList (sentence, pos, string) values (@thissentence, @pos, @chars)
				[b]else
					set @pos = @pos - 1[/b]			
			end
			set @chars = ''
			set @thischar = @thischar + 1
		end
	end
	if (select count(*) from #WordsList where sentence = @thissentence) = 0 
					insert into #WordsList (sentence, pos, string) values (@thissentence, 1, '<No Valid Words>')
		
	set @thissentence = @thissentence + 1
end

should do the trick. I've also got rid of the print statement - it was there while I was testing.

[vampire][bat]
 
OK.

Any proposal how to do benchmarking? Methodology, how many rows etc... ?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Slight change to add the additional column that Denis suggested (i.e. each word's position within the string).

I've also set the position fields for <No Valid Words> records to 0 (I thought it made more sense than the 1 I originally used).

Code:
set nocount on

create table #TestStrings( PK int primary key, string varchar(255) null )
insert into #TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into #TestStrings values (2, NULL )
insert into #TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into #TestStrings values (4, 'Please do not look into laser with remaining eyeball!')
insert into #TestStrings values (5, 'Blah!')
insert into #TestStrings values (6, ':)')
insert into #TestStrings values (7, '** snort **')

create table #NoiseWords ( word varchar(16) )
insert into #NoiseWords values ( 'the' )
insert into #NoiseWords values ( 'do' )
insert into #NoiseWords values ( 'is' )
insert into #NoiseWords values ( 'on' )

--Build a list of valid charachaters
create table #ValidChars (ValidChar char(1))
declare @vc int
set @vc = 65
while @vc < 65 + 26
begin
	insert into #ValidChars values (char(@vc))
	set @vc = @vc + 1
end
set @vc = 97
while @vc < 97 + 26
begin
	insert into #ValidChars values (char(@vc))
	set @vc = @vc + 1
end
set @vc = 48
while @vc < 48 + 10
begin
	insert into #ValidChars values (char(@vc))
	set @vc = @vc + 1
end
insert into #ValidChars values ('_')

--Get the individual words
create table #WordsList (sentence int, pos int, posinstring int, string varchar(255) null ) 
declare @chars varchar(255)
declare @string varchar(255)
declare @pos int
declare @maxsentence int
declare @thissentence int
declare @charsinsentence int
declare @thischar int
declare @wordstart int

set @thissentence = 1
set @maxsentence = (select max(PK) from #TestStrings)

while @thissentence <= @maxsentence
begin
	set @pos = 0
	set @wordstart = 1
	set @string = (select string from #TestStrings where pk = @thissentence)
	set @charsinsentence = len(@string)
	set @thischar = 1
	set @chars = ''
	--insert NULL if necessary
	if @string is null
	begin
		insert into #WordsList values (@thissentence, 1, 1, @string)
	end
	else
	while @thischar <= @charsinsentence
	begin
		if substring(@string, @thischar, 1) IN (SELECT ValidChar from #ValidChars)
		begin
			set @chars = @chars + substring(@string, @thischar, 1)
			set @thischar = @thischar + 1
		end
		else
		begin
			if len(@chars) > 2
			begin
				set @pos = @pos + 1
				if @chars not in (select word from #NoiseWords)
				begin
					insert into #WordsList values (@thissentence, @pos, @wordstart, @chars)
					set @wordstart = @wordstart + @thischar       
				end
				else
        	set @pos = @pos - 1   
			end
			set @chars = ''
			set @thischar = @thischar + 1
			set @wordstart = @thischar			
		end
	end
	if (select count(*) from #WordsList where sentence = @thissentence) = 0 
					insert into #WordsList  values (@thissentence, 0, 0, '<No Valid Words>')
	set @thissentence = @thissentence + 1
end

select * from #WordsList

drop table #TestStrings
drop table #NoiseWords
drop table #ValidChars
drop table #WordsList

set nocount off

[vampire][bat]
 
OK, starfest is over. Here comes performance test.

This multiplies sample data 12 times... 7*2^12 = 7*4096 = 28672 rows. Let it be on tempdb :E. Numbers table is also created here; in reality this table is often permanent and rarely created on demand:
Code:
create table #TestStrings( PK int primary key, string varchar(255) null )
insert into #TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into #TestStrings values (2, NULL )
insert into #TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into #TestStrings values (4, 'Please do not look into laser with remaining eyeball!')
insert into #TestStrings values (5, 'Blah!')
insert into #TestStrings values (6, ':)')
insert into #TestStrings values (7, '** snort **')

declare @i int; set @i = 1
declare @lastPK int; select @lastPK = max(PK) from #TestStrings
while @i <= 12
begin
	insert into #TestStrings select @LastPK + PK, string from #TestStrings
	set @lastPK = @lastPK + @@rowcount
	set @i = @i + 1
end

create table #NoiseWords ( word varchar(16) )
insert into #NoiseWords values ( 'the' )
insert into #NoiseWords values ( 'do' )
insert into #NoiseWords values ( 'is' )
insert into #NoiseWords values ( 'on' )


CREATE TABLE #NumberPivot (NumberID INT PRIMARY KEY)
DECLARE @intLoopCounter INT
SELECT @intLoopCounter =0
WHILE @intLoopCounter <=999 
BEGIN
	INSERT INTO #NumberPivot VALUES (@intLoopCounter)
	SELECT @intLoopCounter = @intLoopCounter +1
END

Now testing candidates turned into stored procedures. I tried to make reply shorter... you fill in what is missing (see lines in bold):
Code:
-------
create procedure tokenize_sqldenis
as
set nocount on

[b]copy & paste code here[/b]
...
-- NOTE: row_number() commented so that people can test code on SQL2000 
select pk, -- , ROW_NUMBER() OVER(PARTITION BY PK ORDER BY PosInString ) as 'pos '
	Word,PosInString
from #HoldStuff
order by 1, 3

drop table #HoldStuff --,#TestStrings,#NumberPivot,#NoiseWords
go

-------
create proc tokenize_earthandfire
as
set nocount on

[b]copy & paste code here[/b]

...
select * from #WordsList

-- drop table #TestStrings
-- drop table #NoiseWords
drop table #ValidChars
drop table #WordsList

go

And exec times are (AMD64/3200)
- tokenize_sqldenis: 01:32.9
- tokenize_earthandfire: 04:36.9

Plz post your performance results (and hardware config) here.

[snail] Btw. Do you want to make it faster? In a new puzzle perhaps?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Running in QA with extended teststrings table:

Denis:
[tt]First run 1:06
Second run 0:50[/tt]

Me:
[tt]got bored waiting after 15 mins and aborted query[/tt]


spec:
Laptop with SQL2000 developer edition, 1.1Ghz, 512Mb Ram and hard disk in dire need of defragging.

I think Denis wins the speed stakes (but only just [smile])


As an aside, I've been experimenting with my code to allow for for both hyphenated words and words with an apostrophe in them to be acceptable. I added a couple of sentences to the teststrings table to provide for some additional testing and came up with interesting results.

Both Denis' and my solution work (albeit at different speeds) on the test strings provided by vongrunt, but both have a bug in them. A different bug in each one.

So, a bonus star for the first person to spot both bugs


[vampire][bat]
 

not exactly right, just an idea...

Code:
/* two temporary tables */

select * into tempstrings from teststrings 

create table results 
 ( seq int identity(1,1), pk int, string varchar(255))


/**/
update tempstrings 
 set string = rtrim(ltrim(string))

insert into results (pk, string)
  select pk, left(string, charindex(' ', string, 1) - 1 ) from tempstrings 
 where charindex(' ', string, 1) > 1 or string is null
 
update tempstrings 
 set string = 
  right(string, len(string) - charindex(' ', string, 1))
   where charindex(' ', string, 1) > 1

while @@rowcount > 0 
begin

update tempstrings 
 set string = ltrim(string)

 insert into results 
  select pk, 
   left(string, charindex(' ', string, 1) - 1 ) 
  from tempstrings 
   where charindex(' ', string, 1) > 1
 
 update tempstrings 
  set string = 
   right(string, len(string) - charindex(' ', string, 1))
   where charindex(' ', string, 1) > 1

end

/* could be any thing depends on the rule */
delete results 
 where string like '%[^a-z_]%' and string not like '%[a-z]%'

delete results 
 where string in ( select word from noisewords)

/* reorder the results */

select identity(int, 1, 1) as seq, t0.pk, t0.string 
 into final_reaults 
 from results t0 order by pk, seq asc
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top