Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

  • Congratulations Chriss Miller on being selected by the Tek-Tips community for having the most helpful posts in the forums last week. Way to Go!

Linguistic Stemming Question

Status
Not open for further replies.

rja

Programmer
Joined
Jun 1, 2000
Messages
4
Location
US
Working on a &quot;simplified&quot; program to normalize and count English words, picking up all those with the same stem (removing -ed, -ing, -s suffixes).&nbsp;&nbsp;<br><br>It works well for &quot;standard&quot; data, but not so well for &quot;exceptions&quot;, words like &quot;house&quot;, &quot;vet&quot;, etc.&nbsp;&nbsp;Any suggestions?<br><br>Here's what I've got so far:<br><br>#-----------------------------------------------------------<br># Purpose: This program takes a text file as input, #removes the past tense (“ed”), plural(“s”), and participle #(“ing”) stems, and prints out a list of the<br># word stems/types that appear in the text, along with a <br># count of how many <br># tokens there are of each type.<br>#<br>#-----------------------------------------------------------<br><br>$numberOfWords = 0; # no words yet.<br><br>print &quot;** Name of file to tokenise: &quot;; # Ask for a file<br>$filename = &lt;STDIN&gt;; # get the file<br>chomp($filename); <br><br>open (FILE,$filename) or # Open file <br>&nbsp;&nbsp;&nbsp;&nbsp;die &quot;Error: could not open file \&quot;$filename\&quot;\n&quot;; <br><br>while (&lt;FILE&gt;) { &nbsp;&nbsp;&nbsp;&nbsp;<br>&nbsp;&nbsp;&nbsp;&nbsp;@tokens = split /\s/; # Split the line<br><br>&nbsp;&nbsp;&nbsp;&nbsp;foreach $token&nbsp;&nbsp;(@tokens) { # For each word...<br> $token =~ s/[,.!?]$//; # Remove punct<br> $token =~ tr/A-Z/a-z/; # Normalise to lc<br> $token =~ s/s$¦ed$¦ing$//; # Remove suffixes <br> ++$numberOfWords;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Increment count.<br> ++$numberOfTokens{$token};&nbsp;&nbsp;&nbsp;&nbsp;&nbsp; # Increment count <br>&nbsp;&nbsp;&nbsp;&nbsp;}<br>}<br><br>close(FILE); # Close the file<br><br><br># sort the array of type counts, and print them out:<br><br>foreach $type (sort keys %numberOfTokens) {<br>&nbsp;&nbsp;&nbsp;&nbsp;print &quot;$type\t$numberOfTokens{$type}\n&quot;;<br>}<br><br><br>$numberOfTypes = keys %numberOfTokens; # Get the types.<br><br>print &quot;** File contains $numberOfWords tokens over $numberOfTypes types.\n&quot;;<br><br><br>#-End-------------------------------------------------------<br><br><br><br><br><br><br>
 
Could I suggest that your line<br><FONT FACE=monospace><b><br>@tokens = split /\s/;<br></font></b><br>be written as<br><FONT FACE=monospace><b><br>@tokens = split /\s+/;<br></font></b><br>This would deal with multiple whitespace characters, otherwise you get two tokens for &quot; and&nbsp;&nbsp;and&quot; instead of two counts for a single token.<br><br>Your line<br><FONT FACE=monospace><b><br>$token =~ s/s$¦ed$¦ing$//; # Remove suffixes <br></font></b><br>could be extended slightly, like this maybe.<br><FONT FACE=monospace><b><br>if($token =~ s/s$/){<br>&nbsp;&nbsp;$token =~ s/s$// unless UsuallyEndsWith($token,'s');<br>} elsif($token =~ s/ed$/){<br>....<br></font></b><br>with <FONT FACE=monospace><b>UsuallyEndsWith</font></b> defined as<br><FONT FACE=monospace><b><br>sub UsuallyEndsWith($$){<br>my ($token, $suffix) = @_;<br>&nbsp;&nbsp;if ($token =~ /^miss$¦^bass$¦other exceptions/){<br>&nbsp;&nbsp;&nbsp;&nbsp;1<br>&nbsp;&nbsp;} else {<br>&nbsp;&nbsp;&nbsp;&nbsp;0<br>&nbsp;&nbsp;}<br>}<br></font></b><br>Interesting problem, there's a good example of this kind of analysis in the Spring issue of The Perl Journal.<br> <p>Mike<br><a href=mailto:michael.j.lacey@ntlworld.com>michael.j.lacey@ntlworld.com</a><br><a href= Cargill's Corporate Web Site</a><br>Please -- Don't send me email questions without posting them in Tek-Tips as well. Better yet -- Post the question in Tek-Tips and send me a note saying "Have a look at so-and-so in the thingy forum would you?"
 
Status
Not open for further replies.

Part and Inventory Search

Sponsor

Back
Top