Remove word from phrase, logical OR problem? 2

Neomalfoy · Oct 27, 2004

The script is partially working. It should read in a list of words in the form word_word (or word_word_word etc. eventually), splits on the underscore and if a stopword is found for either position then it won't be in the new list created. This works for stopwords in the first position but phrases with stopwords in the second position are still showing up on the new list. It seems that the logical OR is not working and I don't understand why it is failing.

Help appreciated

Code:

#!/usr/bin/perl
use strict;

my (@stopW, $freq, $word, $tuple, $word, $sw2, $c, $d);

######################################### Reading Stopwords ###
open(STOPfile,"<./stopwords.txt");
while (<STOPfile>)
{
chop;
push(@stopW, $_);
}
close(STOPfile);
###################################################################

my $inputDIR="./";

my ($freq,$word);

for ($tuple=1; $tuple<=2; $tuple++)
{

open (INfile,"<".$inputDIR."file".$tuple.".txt");
open (OUTfile,">cleanup".$tuple.".txt");

while(<INfile>){
chop;
($freq,$word)=split(/:/,$_);

# leaving single words alone
if ($tuple==1)
{ foreach (@stopW) {
if (length($word)<=1 ) { next; }
if ($word eq $_) {
$word=$word."*"; last; }
}
}

# $tuple=2, eliminate phrases with a stopword
# or if any of the words is <=1 (0,1)
if ($tuple==2)
{ ($a,$b) = split("_",$word);
if ( (length($a)<=1) || (length($b)<=1) ) { next; }
foreach (@stopW)
{ if ( ($a eq $_) || ($b eq $_))
{ $sw2=1; last; }
}
}
if ($sw2==1) { $sw2=0; next; }

#print "$word\n";

#print $freq.":".$word."\n";
printf OUTfile "%s:%s\n", $freq,$word ;
} # end while

close INfile;
close OUTfile;
} # end for tuple

exit(0);

ishnid · Oct 28, 2004

The easiest way to perform stopword-removal in Perl is to use a hash (look at the example in the Lingua::EN::StopWords for example).

Assuming your list of stopwords is one-per-line in stopwords.txt, this is how I'd do it:

Code:

my %stopwords;
open STOP, 'stopwords.txt';
while(<STOP>) {
   chomp; #don't use chop - too dangerous
   $stopwords{ $_ } = 1;
}
close STOP;

while(my $line = <DATA>) {
   print join '_', grep !$stopwords{ $_ }, split '_', $line;
}
__DATA__
some_sentence_that_contains_a_stopword_or_two
here_is_another_such_sentence
and_yet_another

Neomalfoy · Oct 28, 2004

Hi Ishnid,

thanks for the tip it will help me out when I want to remove all stopwords which is not what I want to do in this case. I should have explained better.

I am examining each word and determing if it's a stopword because the phrase will be in a certain format after processing that would allow stopwords in certain cases i.e.
one word phrase -> no change
two word phrases -> nonstopword_nonstopword
three word phrases -> nonstopword_stopword?_nonstopword
for four word phrases -> nonstopword_stopword?_stopword?_nonstopword

ishnid · Oct 28, 2004

By any chance are you running this on a windows system?

If so, using chomp instead of chop should solve your problem. chomp removes newlines from the end strings, whereas chop removes the last character in the string. On windows systems, newlines are marked by "\r\n" and chop will only remove the "\n", leaving an extra "\r" character at the end of your second word, causing it not to match.

I notice you're using $a and $b as variable names. This is generally not a good idea as these are special variables used in sorting. They'll probably behave as expected most of the time but when they don't, it can take ages to find what the problem is.

Neomalfoy · Oct 28, 2004

I'm running it on both windows and unix. I get the same results in my output files even after changed to chomp and the names of the variables.

Sample output:

As I said earlier, I can filter out stopwords in the first position but can't get it to filter out from the second/third/etc.

original list ----> new list with phrases filtered out

two word phrase
191

f_the ----> 23:magnetic_field
74:the_solar ----> 22:solar_wind
68:in_the ----> 22:sunspot_group
60:the_sun ----> 21:solar_eclipse
42:the_earth ----> 19:white_light
36:from_the ----> 18:earth_s *error
35

n_the ----> 17

ortion_of *error
24:the_moon ----> 17:solar_system
23:magnetic_field ----> 17:associated_with *error
21:to_the ----> 12:frequency_spectrum
20

f_solar ----> 12:geomagnetic_storm
19:h_alpha ----> 12:radio_frequency
19:white_light ----> 12:unit_of *error
18:earth_s ----> 11:universal_time
18:and_the ----> 11:equal_to *error

three word phrase
31

f_the_solar --> 17

ortion_of_the *error
29

f_the_sun --> 12:radio_frequency_spectrum
17:the_earth_s ---> 11:total_solar_eclipse
17

ortion_of_the --> 10:region_of_the *error
16

f_the_earth --> 10:bipolar_sunspot_group
13:the_solar_atmosphere --> 10:sunspot_group_with *error
12

f_the_moon --> 9:layer_of_the *error
12

f_the_radio --> 9:surface_of_the *error
12:the_solar_wind --> 9:earth_s_surface
12:that_portion_of --> 8:frequency_spectrum_from *error
12:radio_frequency_spectrum --> 8:group_with_penumbra

The code I'm using to check three word phrase:

if ($tuple==3)
{ ($as,$bs,$cs) = split("_",$word);
if (length($as)<=1 || length($bs)==0 || length($cs)<=1 ) { next; }
foreach (@stopW){
my $stopWord=$_;
if ( ($as eq $stopWord) || ($cs eq $stopWord) )
{ $sw3=1; last; }
}
}
if ($sw3==1) { $sw3=0; next; }

Neomalfoy · Oct 29, 2004

I gave up using split and used a regular expression instead, and it's working now.

Tek-Tips is the largest IT community on the Internet today!

Members share and learn making Tek-Tips Forums the best source of peer-reviewed technical information on the Internet!

Remove word from phrase, logical OR problem? 2

Neomalfoy

Programmer

ishnid

Programmer

Neomalfoy

Programmer

ishnid

Programmer

Neomalfoy

Programmer

Neomalfoy

Programmer

Similar threads

Part and Inventory Search

Sponsor