Linguistic Stemming Question

rja · Jun 1, 2000

Working on a "simplified" program to normalize and count English words, picking up all those with the same stem (removing -ed, -ing, -s suffixes).   It works well for "standard" data, but not so well for "exceptions", words like "house", "vet", etc.  Any suggestions? Here's what I've got so far: #----------------------------------------------------------- # Purpose: This program takes a text file as input, #removes the past tense (“ed”), plural(“s”), and participle #(“ing”) stems, and prints out a list of the # word stems/types that appear in the text, along with a # count of how many # tokens there are of each type. # #----------------------------------------------------------- $numberOfWords = 0; # no words yet. print "** Name of file to tokenise: "; # Ask for a file $filename = <STDIN>; # get the file chomp($filename); open (FILE,$filename) or # Open file     die "Error: could not open file \"$filename\"\n"; while (<FILE&gt

{          @tokens = split /\s/; # Split the line     foreach $token  (@tokens) { # For each word... $token =~ s/[,.!?]$//; # Remove punct $token =~ tr/A-Z/a-z/; # Normalise to lc $token =~ s/s$¦ed$¦ing$//; # Remove suffixes ++$numberOfWords;               # Increment count. ++$numberOfTokens{$token};      # Increment count     } } close(FILE); # Close the file # sort the array of type counts, and print them out: foreach $type (sort keys %numberOfTokens) {     print "$type\t$numberOfTokens{$type}\n"; } $numberOfTypes = keys %numberOfTokens; # Get the types. print "** File contains $numberOfWords tokens over $numberOfTypes types.\n"; #-End-------------------------------------------------------

MikeLacey · Jun 1, 2000

Could I suggest that your line @tokens = split /\s/; be written as @tokens = split /\s+/; This would deal with multiple whitespace characters, otherwise you get two tokens for " and  and" instead of two counts for a single token. Your line $token =~ s/s$¦ed$¦ing$//; # Remove suffixes could be extended slightly, like this maybe. if($token =~ s/s$/){   $token =~ s/s$// unless UsuallyEndsWith($token,'s'); } elsif($token =~ s/ed$/){ .... with UsuallyEndsWith defined as sub UsuallyEndsWith($$){ my ($token, $suffix) = @_;   if ($token =~ /^miss$¦^bass$¦other exceptions/){     1   } else {     0   } } Interesting problem, there's a good example of this kind of analysis in the Spring issue of The Perl Journal. Mike <a href=mailto:michael.j.lacey@ntlworld.com>michael.j.lacey@ntlworld.com</a> <a href=

http://www.cargill.com/>

Cargill's Corporate Web Site</a> Please -- Don't send me email questions without posting them in Tek-Tips as well. Better yet -- Post the question in Tek-Tips and send me a note saying "Have a look at so-and-so in the thingy forum would you?"

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Linguistic Stemming Question

rja

Programmer

MikeLacey

MIS

Similar threads

Part and Inventory Search

Sponsor