Working on a "simplified" program to normalize and count English words, picking up all those with the same stem (removing -ed, -ing, -s suffixes). <br><br>It works well for "standard" data, but not so well for "exceptions", words like "house", "vet", etc. Any suggestions?<br><br>Here's what I've got so far:<br><br>#-----------------------------------------------------------<br># Purpose: This program takes a text file as input, #removes the past tense (“ed”), plural(“s”), and participle #(“ing”) stems, and prints out a list of the<br># word stems/types that appear in the text, along with a <br># count of how many <br># tokens there are of each type.<br>#<br>#-----------------------------------------------------------<br><br>$numberOfWords = 0; # no words yet.<br><br>print "** Name of file to tokenise: "; # Ask for a file<br>$filename = <STDIN>; # get the file<br>chomp($filename); <br><br>open (FILE,$filename) or # Open file <br> die "Error: could not open file \"$filename\"\n"; <br><br>while (<FILE>
{ <br> @tokens = split /\s/; # Split the line<br><br> foreach $token (@tokens) { # For each word...<br> $token =~ s/[,.!?]$//; # Remove punct<br> $token =~ tr/A-Z/a-z/; # Normalise to lc<br> $token =~ s/s$¦ed$¦ing$//; # Remove suffixes <br> ++$numberOfWords; # Increment count.<br> ++$numberOfTokens{$token}; # Increment count <br> }<br>}<br><br>close(FILE); # Close the file<br><br><br># sort the array of type counts, and print them out:<br><br>foreach $type (sort keys %numberOfTokens) {<br> print "$type\t$numberOfTokens{$type}\n";<br>}<br><br><br>$numberOfTypes = keys %numberOfTokens; # Get the types.<br><br>print "** File contains $numberOfWords tokens over $numberOfTypes types.\n";<br><br><br>#-End-------------------------------------------------------<br><br><br><br><br><br><br>