Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

SQL Puzzle 12b: Tokenization 10

Status
Not open for further replies.

vongrunt

Programmer
Mar 8, 2004
4,863
HR
There are two tables:
Code:
create table TestStrings( PK int primary key, string varchar(255) null )
insert into TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into TestStrings values (2, NULL )
insert into TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into TestStrings values (4, 'Please do not look into laser with remaining eyeball!')
insert into TestStrings values (5, 'Blah!')
insert into TestStrings values (6, ':)')
insert into TestStrings values (7, '** snort **')

create table NoiseWords ( word varchar(16) )
insert into NoiseWords values ( 'the' )
insert into NoiseWords values ( 'do' )
insert into NoiseWords values ( 'is' )
insert into NoiseWords values ( 'on' )
create unique nonclustered index IX_NoiseWords on NoiseWord( word )


Objective

For all rows in TestStrings table, extract words into ordered set:
Code:
PK Pos Word
--.---.------
 1   1 Even 
 1   2 broken 
 1   3 clock 
 1   4 right 
 1   5 two 
 1   6 times 
 1   7 day
 1   8 accident
 2   1 Why
 ....
--.---.------

Tokenization rules are:

- words are composed from 0-9, A-Z, a-z and _ (underscore) characters
- all other characters are considered word delimiters
- words listed in NoiseWords table must be ignored
- in addition, all tokens shorter than 2 characters (for example 'a' or 'I') are considered noise words
- do not return empty tokens (""),
- return NULL when input string is NULL (see 2nd test string)

Everything must be returned in one set, with words properly enumerated (column Pos)


Rules & restrictions

Everything is allowed - SQL2000, 2005 - except calling external programs or going .NET/CLR.

There are no time limits - you can shoot immediately.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
This is my extended version to handle apostrophes in words and hyphenated words.


Code:
set nocount on

create table #TestStrings( PK int primary key, string varchar(255) null )
insert into #TestStrings values (1, 'Even a broken clock is right two times a day....on accident.')
insert into #TestStrings values (2, NULL )
insert into #TestStrings values (3, 'Why did the multi-threaded chicken cross the road? other To side. get the')
insert into #TestStrings values (4, 'Please do not look into laser with remaining eyeballs!')
insert into #TestStrings values (5, 'Blah!')
insert into #TestStrings values (6, ':)')
insert into #TestStrings values (7, '** snort **')
--a couple of extra sentences added to test hyphenation and apostrophes
--these two records have been changed from my original so that they don't show the bugs
--  referred to in my previous post
insert into #TestStrings values (8, 'Here''s an apostrophe and an ignored--hyphen--and an unattached - hyphen.')
insert into #TestStrings values (9, 'This has a '' quote symbol not attached to a word.')


create table #NoiseWords ( word varchar(16) )
insert into #NoiseWords values ( 'the' )
insert into #NoiseWords values ( 'do' )
insert into #NoiseWords values ( 'is' )
insert into #NoiseWords values ( 'on' )
--I've added 'and' to the list
insert into #NoiseWords values ( 'and') 

--Build a list of valid charachaters
create table #ValidChars (ValidChar char(1), IsAlpha bit)
declare @vc int
set @vc = 65
while @vc < 65 + 26
begin
  insert into #ValidChars values (char(@vc), 1)
  set @vc = @vc + 1
end
set @vc = 97
while @vc < 97 + 26
begin
  insert into #ValidChars values (char(@vc), 1)
  set @vc = @vc + 1
end
set @vc = 48
while @vc < 48 + 10
begin
  insert into #ValidChars values (char(@vc), 0)
  set @vc = @vc + 1
end
insert into #ValidChars values ('_', 0)

--Get the individual words
create table #WordsList (sentence int, pos int, posinstring int, string varchar(255) null ) 
declare @chars varchar(255)
declare @string varchar(255)
declare @pos int
declare @maxsentence int
declare @thissentence int
declare @charsinsentence int
declare @thischar int
declare @wordstart int
declare @processed bit

--control variable to determine what to do with misssing sentences
--i.e. those for which there is no record in #TestStrings (pk 8, and 10,11,12,13,14)
--this is a different situation from a record existing but having no valid characters
declare @showmissing bit
set @showmissing = 1

set @thissentence = 1
set @maxsentence = (select max(PK) from #TestStrings)

while @thissentence <= @maxsentence
begin
  set @pos = 0
  set @wordstart = 1
  set @string = (select string from #TestStrings where pk = @thissentence)
  set @charsinsentence = len(@string)
  set @thischar = 1
  set @chars = ''
	--insert NULL as necessary
	if @string is null
  	insert into #WordsList values (@thissentence, 1, 1, null)
  else
	begin	--not needed but makes alignment neater
  	while @thischar <= @charsinsentence
  	begin
			set @processed = 0
	  	if substring(@string, @thischar, 1) IN (select ValidChar from #ValidChars)
				or @thischar > 1 and @thischar < @charsinsentence and 
							(substring(@string, @thischar, 1) = '-' or substring(@string, @thischar, 1) = '''') and 
							substring(@string, @thischar - 1, 1) IN 
							(select ValidChar from #ValidChars where IsAlpha = 1) and
							substring(@string, @thischar +1, 1) IN 
							(select ValidChar from #ValidChars where IsAlpha = 1)
    	begin
      	set @chars = @chars + substring(@string, @thischar, 1)
      	set @thischar = @thischar + 1
				set @processed = 1
    	end
    	if @processed = 0 or @thischar > @charsinsentence
			begin
    		if len(@chars) > 2
      	begin
        	set @pos = @pos + 1
        	if @chars not in (select word from #NoiseWords)
	     			insert into #WordsList values (@thissentence, @pos, @wordstart, @chars)
        	else
						set @pos = @pos - 1   
      	end
      	set @chars = ''
      	set @thischar = @thischar + 1
      	set @wordstart = @thischar
    	end
  	end
  	if (select count(*) from #WordsList where sentence = @thissentence) = 0 
	  	insert into #WordsList  values (@thissentence, 0, 0, '<No Valid Words>')
	end	--matches the begin above for alignment purposes
	set @thissentence = @thissentence + 1  
end

select * from #WordsList

drop table #TestStrings
drop table #NoiseWords
drop table #ValidChars
drop table #WordsList

set nocount off

[vampire][bat]
 
> So, a bonus star for the first person to spot both bugs

I'll keep my mouth shut on this one [noevil]

maswien: string column is not always well tokenized ("multi-threaded", "side.", "day....on" etc.) and seq column keeps sort but doesn't provide enumeration (1, 2, 3...).

Code itself should be a lot faster than everything else so far... first run finished after 15.0 seconds, second 9.1s.

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
I'm revising mine based on maswien's regex (I wasn't aware that this was available in TSQL). I'm down to 42 seconds for the large recordset (with my bug). Just got to get rid of that!!

[vampire][bat]
 
This machine is a IBM Thinkpad Centrino 1.86 Ghz, 1GB RAM

exec tokenize_sqldenis --: 11 seconds first run
take out print statement 6 seconds second run

exec tokenize_earthandfire--: 5 minutes 12 seconds first run, 4 minutes 45 seconds second run

so it looks like the slower the computer is the bigger the time difference between the 2 procs
maybe I'll run it at work tomorrow on 8 CPU 4 GB mini-monster



Denis The SQL Menace
SQL blog:
Personal Blog:
 
That regex thing is what I originally had in mind (see 1st and 2nd tokenization rule).

AFAIK it is possible to do everything without looping of any kind (SQL2000 and above - numbers table, SQL2005 - UDF and CROSS APPLY).

> This machine is a IBM Thinkpad Centrino 1.86 Ghz, 1GB RAM

Weird... are you absolutely sure about data (there should be cca 29000 rows in #TestStrings)?

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Yup, that's the correct count.

Looks like PRINT greatly affects results (AMD64/3200, 1GB RAM, 2x36GB SCSI noraid):
- original code: 00:59.2, 00:25.7, 01:32.0
- without PRINT: 12.9, 10.8 10.8

On another machine difference almost does not exist (Xeon 3.06GHz, 2GB RAM, RAID5 array):
- original code 00:16.1, 00:16.0, 00:16.1,
- without PRINT: 00:16.2, 00:15.8, 00:15.9

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
I've not set up the SP, but I'm down to 46 seconds for mine on my machine (including trapping both bugs)

[vampire][bat]
 
Hm... doing everything set-based (no explicit loops) is a bit tricky because of:

- ranking per group (Pos column) - SQL2000 only of course,
- exceptions (string is NULL or string has no words)

And after all optimizations some 40% of exec time went on noise word checks (with LEFT OUTER JOIN, IN() was slower).

Numbers table:
Code:
create table Numbers( N int primary key )
insert into Numbers values (1)

while @@rowcount < 512
	insert into Numbers select N+(select max(N) from Numbers) from Numbers

And actual code (SQL2000)
Code:
create proc tokenize_vongrunt
as
set nocount on

-- this assumes noise word list is not too long
declare @nwlist varchar(2000); set @nwlist = '-'
select @nwlist = @nwlist+word+'-' from #NoiseWords

select identity(int, 1, 1) as rowno, PK, strpos, word
into #blah
from
(	select PK, strpos, word
	from
	(	select PK, strpos, left(substr, patindex('%[^A-Za-z0-9_]%', substr+' ')-1) as word
		from
		(	select TS.PK, N.N as strpos, substring( string + ' ', N.N, 255) as substr
			from #teststrings TS
			inner join Numbers N on N.N between 1 and Len(TS.string)
			where substring(' '+TS.string, N.N, 2) like '[^A-Za-z0-9_][A-Za-z0-9_]'
		) T
	) R
	where len(word) >= 2
		and charindex('-'+word+'-', @nwlist) = 0
union all
	select PK, 1, NULL
	from #TestStrings
	where string is null 
	-- uncomment line below to return NULLs for rows with no (valid) words (c) earthandfire
	-- or string not like '%[A-Za-z0-9_]%'
) X
order by PK, rowno

select A.PK, A.rowNo-B.minrow+1 as Pos, A.strpos, A.word
from #blah A
inner join
(	select PK, min(rowno) as minrow
	from #blah
	group by PK
) B on A.PK=B.PK
order by A.PK, pos

drop table #blah

go

Exec times:

- AMD64/3200: 6.0, 6.2, 6.0
- Xeon 3.06GHz: 6.5, 6.6, 6.4

I'm curious to see how UDF + OUTER|CROSS APPLY (SQL2005) stands against all code we posted...

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
vongrunt, yours runs at 17 - 20 seconds, Denis' is running in about 30 - 35 seconds and I've got mine down to 40 - 45 seconds (just over 42 seconds average in 10 runs) with both bugs handled.

I've positioned timer code after build of #TestStrings and #NoiseWords. I just used QA's display initially which is not so accurate.

[vampire][bat]
 
Vongrunt said:
(aka: when George wakes up :p)

...

I'm curious to see how UDF + OUTER|CROSS APPLY (SQL2005) stands against all code we posted...

Well, I can tell you that the UDF stuff is pretty ugly. I needed to use real tables instead to temp table stuff, and with the large data set, it took about 2 1/2 minutes without accomodating the NULL's for String. [sad]

I'm gonna work on this a little more.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Is 5 seconds good?

I'm running this on a toaster. To jack up the speed a little, I put the toaster in the microwave. [wink]

I have to make another modification or 2, so I'll post the code soon.

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
earthandfire

Before testing this code, put your machine in the microwave. It helps. I thought that putting the microwave in the hot tub would help, but sadly, it seems to slow things down a little. [sad]

Anyway, here's the code.
Code:
Alter Procedure tokenize_gmmastros
As
Set NOCOUNT ON
Declare @Temp Table(PK Integer Primary Key, String VarChar(1000))
Declare @Noise Table(RowId Integer Identity(1,1), Word VarChar(100))


Insert Into @Noise(Word) 
Select Word 
From NoiseWords

[green]-- I took the replace stuff from [purple][b]Denis's[/b][/purple] code.  Thanks.[/green]
Insert Into @Temp(PK, String)
Select PK, LTrim(RTrim(replace(replace(
replace(replace(replace(replace(String,'.',' '),'*',''),
':',''),'?',''),'-',' '),'!','')))
From TestStrings

[green]-- Get rid of noise words[/green]
Declare @i Integer
Declare @Max Integer
Declare @Word VarChar(100)

Select 	@i = 1,
		@Max = Max(RowId)
From	@Noise

While @i <= @Max
	Begin
		Select	@Word = Word
		From	@Noise
		Where	Rowid = @i

		Update 	@Temp
		Set 	String = Replace(String, ' ' + @Word + ' ', ' ')
		Set @i = @i + 1

		Update	@Temp
		Set		String = Left(String, Len(String) - Len(@Word))
		Where	Right(String, Len(@Word)) = @Word

		Update	@Temp
		Set		String = Right(String, Len(String) - Len(@Word))
		Where	Left(String, Len(@Word)) = @Word

	End

Declare @Out Table(PK Integer, Pos Integer, Word VarChar(100) Primary Key (PK, POS))

Insert Into @Out(PK, Pos, Word) Select PK, 1, NULL From @Temp Where String Is Null

Delete From @Temp Where Len(IsNull(String, '')) < 2

Set @i = 1
While	(Select Max(Len(String)) From @Temp) > 0
	Begin
		Update @Temp Set String = LTrim(RTrim(String))

		[green]-- Insert the first word of the sentence
		-- into the output table.[/green]
		Insert	Into @Out(PK, Pos, Word)
		Select 	PK, @i, Left(String, Case When PatIndex('%[^A-Za-z0-9_]%', string) = 0 Then Len(String) Else PatIndex('%[^A-Za-z0-9_]%', string) - 1 End)
		From	@Temp
		Where   Case 	When PatIndex('%[^A-Za-z0-9_]%', string) = 0 
						Then Len(String) 
						Else PatIndex('%[^A-Za-z0-9_]%', string) - 1 
						End >= 2
		
		[green]-- Remove the word we just added to the output[/green]
		Update 	T
		Set		T.String = LTrim(Right(T.String, Len(T.String) - Len(O.Word)))
		From	@Temp T
				Inner Join @Out O
					On T.PK = O.PK
		Where	O.Pos = @i
		
		[green]-- Out damn space, out![/green]
		Update @Temp Set String = LTrim(String)

		[green]-- Get rid of 1 letter words[/green]
		Update 	@Temp
		Set		String = Right(String, Len(String)-2)
		Where	String Like '[A-Za-z0-9_] %'

		[green]-- Remove records from the temp table
		-- that we don't have to look at anymore.[/green]
		Delete
		From	@Temp
		Where	Case 	When PatIndex('%[^A-Za-z0-9_]%', string) = 0 
						Then Len(String) 
						Else PatIndex('%[^A-Za-z0-9_]%', string) - 1 End < 3
				And Len(String) < 3


		Set @i = @i + 1

	End

Select * from @Out Order By PK, Pos

go

Declare @S DateTime
Set @S = GetDate()

exec tokenize_gmmastros

Select DateDiff(Millisecond, @S, GetDate())

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
I've run it five times, it takes between 20 and 25 seconds, a bit slower than vongrunts - and you don't handle sentence 6. Disgraceful [wink]

[vampire][bat]
 
OK, I know that it's slower than vongrunt's. His runs in 4.5 seconds on my puter.

Could you please explain, "[tt]you don't handle sentence 6[/tt]"?

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
I was only joking. 6 returns an empty string (after removing the symbols), so I added <No Valid Words> and vongrunt added an option for returning NULL. Both you and Denis skip the sentence.

[vampire][bat]
 
Here's a version that should give a better execution time.

Code:
Alter Procedure tokenize_gmmastros
As
Set NOCOUNT ON
Declare @Temp 
Table	(
		PK Integer Primary Key, 
		String VarChar(1000)
		)
Declare @Noise Table(RowId Integer Identity(1,1), Word VarChar(100))

Insert 
Into 	@Noise(Word) 
Select 	Word 
From 	#NoiseWords

-- I took the replace stuff from Denis's code.  Thanks.
Insert Into @Temp(PK, String)
Select PK, LTrim(RTrim(replace(replace(
replace(replace(replace(replace(String,'.',' '),'*',''),
':',''),'?',''),'-',' '),'!','')))
From #TestStrings

-- Get rid of noise words
Declare @i Integer
Declare @Max Integer
Declare @Word VarChar(100)

Select 	@i = 1,
		@Max = Max(RowId)
From	@Noise

While @i <= @Max
	Begin
		Select	@Word = Word
		From	@Noise
		Where	Rowid = @i

		Update 	@Temp
		Set 	String = Replace(String, ' ' + @Word + ' ', ' ')
		Where	String Like '% ' + @Word + ' %'
		Set @i = @i + 1

		Update	@Temp
		Set		String = Left(String, Len(String) - Len(@Word))
		Where	String Like '% ' + @Word

		Update	@Temp
		Set		String = Right(String, Len(String) - Len(@Word))
		Where	String Like @Word + ' %'
	End

Declare @Out Table(PK Integer, Pos Integer, Word VarChar(100) Primary Key (PK, POS))

Insert Into @Out(PK, Pos, Word) Select PK, 1, NULL From @Temp Where String Is Null

Delete From @Temp Where Len(IsNull(String, '')) < 2

While Exists(Select 1 From @Temp Where PatIndex('% [A-Za-z0-9_] %', String) > 1)
	Update	@Temp
	Set		String = Left(String, PatIndex('% [A-Za-z0-9_] %', String)) +
			Right(String, Len(String) - PatIndex('% [A-Za-z0-9_] %', String)-2)
	Where	String Like '% [A-Za-z0-9_] %'

Set @i = 1
While	(Select Max(Len(String)) From @Temp) > 0
	Begin

		-- Insert the first word of the sentence
		-- into the output table.
		Insert	Into @Out(PK, Pos, Word)
		Select 	PK, 
				@i, 
				Left(String, Case 	When CharIndex(' ', string) = 0 
									Then Len(String) 
									Else CharIndex(' ', string) - 1 
									End)
		From	@Temp
		
		-- Remove the word we just added to the output
		Delete	@Temp
		Where	CharIndex(' ', String) = 0

		Update 	@Temp
		Set		String = LTrim(
								Right(String, Len(String) - Case When CharIndex(' ', string) = 0 
																 Then Len(String) 
																 Else CharIndex(' ', string) - 1 End)
								)
	
		Set @i = @i + 1

	End

Select * from @Out Order By PK, Pos

-George

Strong and bitter words indicate a weak cause. - Fortune cookie wisdom
 
Yer code runs 11.2, 9.8, 9.7 and 9.7 seconds on my box-that-doesn't-like-PRINT-statement.

Btw. I tried APPLY() yesterday. Naturally it is still implicitely row-based - for each row in base table server calls UDF, merges result and that's it. So... APPLY() is faster than WHILE loop - but not much - 8.8 seconds with all hacks. Good thing: evaluating table-valued function for each row was not possible in SQL2000. Bad thing: any WHERE clause over applied data slows down query a lot (almost 4x in this case - 32.0s)

And damn, that exec plan in Management Studio looks butt-ugly. What happened with ol' icons? [soapbox]

------
Theory: everybody knows everything, nothing works
Practice: everything works, nobody knows why

[banghead]
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top