Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Linguistic Stemming Question

Status
Not open for further replies.

rja

Programmer
Jun 1, 2000
4
US
Working on a &quot;simplified&quot; program to normalize and count English words, picking up all those with the same stem (removing -ed, -ing, -s suffixes).&nbsp;&nbsp;<br><br>It works well for &quot;standard&quot; data, but not so well for &quot;exceptions&quot;, words like &quot;house&quot;, &quot;vet&quot;, etc.&nbsp;&nbsp;Any suggestions?<br><br>Here's what I've got so far:<br><br>#-----------------------------------------------------------<br># Purpose: This program takes a text file as input, #removes the past tense (“ed”), plural(“s”), and participle #(“ing”) stems, and prints out a list of the<br># word stems/types that appear in the text, along with a <br># count of how many <br># tokens there are of each type.<br>#<br>#-----------------------------------------------------------<br><br>$numberOfWords = 0; # no words yet.<br><br>print &quot;** Name of file to tokenise: &quot;; # Ask for a file<br>$filename = &lt;STDIN&gt;; # get the file<br>chomp($filename); <br><br>open (FILE,$filename) or # Open file <br>&nbsp;&nbsp;&nbsp;&nbsp;die &quot;Error: could not open file \&quot;$filename\&quot;\n&quot;; <br><br>while (&lt;FILE&gt;) { &nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;@tokens = split /\s/; # Split the line<br><br>&nbsp;&nbsp;&nbsp;&nbsp;foreach $token&nbsp;&nbsp;(@tokens) { # For each word...<br> $token =~ s/[,.!?]$//; # Remove punct<br> $token =~ tr/A-Z/a-z/; # Normalise to lc<br> $token =~ s/s$¦ed$¦ing$//; # Remove suffixes <br> ++$numberOfWords;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Increment count.<br> ++$numberOfTokens{$token};&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Increment count <br>&nbsp;&nbsp;&nbsp;&nbsp;}<br>}<br><br>close(FILE); # Close the file<br><br><br># sort the array of type counts, and print them out:<br><br>foreach $type (sort keys %numberOfTokens) {<br>&nbsp;&nbsp;&nbsp;&nbsp;print &quot;$type\t$numberOfTokens{$type}\n&quot;;<br>}<br><br><br>$numberOfTypes = keys %numberOfTokens; # Get the types.<br><br>print &quot;** File contains $numberOfWords tokens over $numberOfTypes types.\n&quot;;<br><br><br>#-End-------------------------------------------------------<br><br><br><br><br><br><br>
 
Could I suggest that your line<br><FONT FACE=monospace><b><br>@tokens = split /\s/;<br></font></b><br>be written as<br><FONT FACE=monospace><b><br>@tokens = split /\s+/;<br></font></b><br>This would deal with multiple whitespace characters, otherwise you get two tokens for &quot; and&nbsp;&nbsp;and&quot; instead of two counts for a single token.<br><br>Your line<br><FONT FACE=monospace><b><br>$token =~ s/s$¦ed$¦ing$//; # Remove suffixes <br></font></b><br>could be extended slightly, like this maybe.<br><FONT FACE=monospace><b><br>if($token =~ s/s$/){<br>&nbsp;&nbsp;$token =~ s/s$// unless UsuallyEndsWith($token,'s');<br>} elsif($token =~ s/ed$/){<br>....<br></font></b><br>with <FONT FACE=monospace><b>UsuallyEndsWith</font></b> defined as<br><FONT FACE=monospace><b><br>sub UsuallyEndsWith($$){<br>my ($token, $suffix) = @_;<br>&nbsp;&nbsp;if ($token =~ /^miss$¦^bass$¦other exceptions/){<br>&nbsp;&nbsp;&nbsp;&nbsp;1<br>&nbsp;&nbsp;} else {<br>&nbsp;&nbsp;&nbsp;&nbsp;0<br>&nbsp;&nbsp;}<br>}<br></font></b><br>Interesting problem, there's a good example of this kind of analysis in the Spring issue of The Perl Journal.<br> <p>Mike<br><a href=mailto:michael.j.lacey@ntlworld.com>michael.j.lacey@ntlworld.com</a><br><a href= Cargill's Corporate Web Site</a><br>Please -- Don't send me email questions without posting them in Tek-Tips as well. Better yet -- Post the question in Tek-Tips and send me a note saying "Have a look at so-and-so in the thingy forum would you?"
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top