Log In

Come Join Us!

Are you a
Computer / IT professional?
Join Tek-Tips Forums!
  • Talk With Other Members
  • Be Notified Of Responses
    To Your Posts
  • Keyword Search
  • One-Click Access To Your
    Favorite Forums
  • Automated Signatures
    On Your Posts
  • Best Of All, It's Free!
  • Students Click Here

*Tek-Tips's functionality depends on members receiving e-mail. By joining you are opting in to receive e-mail.

Posting Guidelines

Promoting, selling, recruiting, coursework and thesis posting is forbidden.

Students Click Here

Linguistic Stemming Question

Linguistic Stemming Question

Linguistic Stemming Question

Working on a "simplified" program to normalize and count English words, picking up all those with the same stem (removing -ed, -ing, -s suffixes).  

It works well for "standard" data, but not so well for "exceptions", words like "house", "vet", etc.  Any suggestions?

Here's what I've got so far:

# Purpose: This program takes a text file as input, #removes the past tense (“ed”), plural(“s”), and participle #(“ing”) stems, and prints out a list of the
# word stems/types that appear in the text, along with a
# count of how many
# tokens there are of each type.

$numberOfWords = 0; # no words yet.

print "** Name of file to tokenise: "; # Ask for a file
$filename = <STDIN>; # get the file

open (FILE,$filename) or # Open file
    die "Error: could not open file \"$filename\"\n";

while (<FILE>) {     
    @tokens = split /\s/; # Split the line

    foreach $token  (@tokens) { # For each word...
$token =~ s/[,.!?]$//; # Remove punct
$token =~ tr/A-Z/a-z/; # Normalise to lc
$token =~ s/s$¦ed$¦ing$//; # Remove suffixes
++$numberOfWords;               # Increment count.
++$numberOfTokens{$token};      # Increment count

close(FILE); # Close the file

# sort the array of type counts, and print them out:

foreach $type (sort keys %numberOfTokens) {
    print "$type\t$numberOfTokens{$type}\n";

$numberOfTypes = keys %numberOfTokens; # Get the types.

print "** File contains $numberOfWords tokens over $numberOfTypes types.\n";


RE: Linguistic Stemming Question

Could I suggest that your line

@tokens = split /\s/;

be written as

@tokens = split /\s+/;

This would deal with multiple whitespace characters, otherwise you get two tokens for " and  and" instead of two counts for a single token.

Your line

$token =~ s/s$¦ed$¦ing$//; # Remove suffixes

could be extended slightly, like this maybe.

if($token =~ s/s$/){
  $token =~ s/s$// unless UsuallyEndsWith($token,'s');
} elsif($token =~ s/ed$/){

with UsuallyEndsWith defined as

sub UsuallyEndsWith($$){
my ($token, $suffix) = @_;
  if ($token =~ /^miss$¦^bass$¦other exceptions/){
  } else {

Interesting problem, there's a good example of this kind of analysis in the Spring issue of The Perl Journal.

Cargill's Corporate Web Site
Please -- Don't send me email questions without posting them in Tek-Tips as well. Better yet -- Post the question in Tek-Tips and send me a note saying "Have a look at so-and-so in the thingy forum would you?"

Red Flag This Post

Please let us know here why this post is inappropriate. Reasons such as off-topic, duplicates, flames, illegal, vulgar, or students posting their homework.

Red Flag Submitted

Thank you for helping keep Tek-Tips Forums free from inappropriate posts.
The Tek-Tips staff will check this out and take appropriate action.

Reply To This Thread

Posting in the Tek-Tips forums is a member-only feature.

Click Here to join Tek-Tips and talk with other members! Already a Member? Login

Close Box

Join Tek-Tips® Today!

Join your peers on the Internet's largest technical computer professional community.
It's easy to join and it's free.

Here's Why Members Love Tek-Tips Forums:

Register now while it's still free!

Already a member? Close this window and log in.

Join Us             Close